Distributed XGBoost on Modin (experimental)

Modin provides an implementation of distributed XGBoost machine learning algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or interfaces could be changed.

Install XGBoost on Modin

Modin comes with all the dependencies except xgboost package by default. Currently, distributed XGBoost on Modin is only supported on the Ray backend, therefore, see the installation page for more information on installing Modin with the Ray backend. To install xgboost package you can use pip:

pip install xgboost

XGBoost Train and Predict

Distributed XGBoost functionality is placed in modin.experimental.xgboost module. modin.experimental.xgboost provides a xgboost-like API for train and predict functions.

modin.experimental.xgboost.train(params: Dict[KT, VT], dtrain: modin.experimental.xgboost.xgboost.ModinDMatrix, *args, evals=(), nthread: Optional[int] = 2, evenly_data_distribution: Optional[bool] = True, **kwargs)

Train XGBoost model.

Parameters:
  • params (dict) – Booster params.
  • dtrain (ModinDMatrix) – Data to be trained against.
  • evals (list of pairs (ModinDMatrix, string)) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
  • nthread (int) – Number of threads for using in each node. By default it is equal to number of threads on master node.
  • evenly_data_distribution (boolean, default True) – Whether make evenly distribution of partitions between nodes or not. In case False minimal datatransfer between nodes will be provided but the data may not be evenly distributed.
  • **kwargs – Other parameters are the same as xgboost.train except for evals_result, which is returned as part of function return value instead of argument.
Returns:

A dictionary containing trained booster and evaluation history. history field is the same as eval_result from xgboost.train.

{'booster': xgboost.Booster,
 'history': {'train': {'logloss': ['0.48253', '0.35953']},
             'eval': {'logloss': ['0.480385', '0.357756']}}}

Return type:

dict

train has all arguments of the xgboost.train function except for evals_result parameter which is returned as part of function return value instead of argument.

modin.experimental.xgboost.predict(model, data: modin.experimental.xgboost.xgboost.ModinDMatrix, nthread: Optional[int] = 2, evenly_data_distribution: Optional[bool] = True, **kwargs)

Run prediction with a trained booster.

Parameters:
  • model (A Booster or a dictionary returned by modin.experimental.xgboost.train.) – The trained model.
  • data (ModinDMatrix.) – Input data used for prediction.
  • nthread (int) – Number of threads for using in each node. By default it is equal to number of threads on master node.
  • evenly_data_distribution (boolean, default True) – Whether make evenly distribution of partitions between nodes or not. In case False minimal datatransfer between nodes will be provided but the data may not be evenly distributed.
Returns:

Array with prediction results.

Return type:

numpy.array

predict is similar to xgboost.Booster.predict with an additional argument, model.

ModinDMatrix

Data is passed to modin.experimental.xgboost functions via a ModinDMatrix object.

class modin.experimental.xgboost.ModinDMatrix(data, label)

DMatrix holding on references to DataFrame.

Parameters:
  • data (DataFrame) – Data source of DMatrix.
  • label (DataFrame) – Labels used for training.

Notes

Currently ModinDMatrix supports only data and label parameters.

Currently, the ModinDMatrix supports modin.pandas.DataFrame only as an input.

A Single Node / Cluster setup

The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions.

To start the Ray runtime on a single node:

import ray
ray.init()

If you already had the Ray cluster you can connect to it by next way:

import ray
ray.init(address='auto')

A detailed information about initializing the Ray runtime you can find in starting ray page.

Usage example

In example below we train XGBoost model using the Iris Dataset and get prediction on the same data. All processing will be in a single node mode.

from sklearn import datasets

import ray
ray.init() # Start the Ray runtime for single-node

import modin.pandas as pd
import modin.experimental.xgboost as xgb

# Load iris dataset from sklearn
iris = datasets.load_iris()

# Create Modin DataFrames
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)

# Create ModinDMatrix
dtrain = xgb.ModinDMatrix(X, y)
dtest = xgb.ModinDMatrix(X, y)

# Set training parameters
xgb_params = {
    "eta": 0.3,
    "max_depth": 3,
    "objective": "multi:softprob",
    "num_class": 3,
    "eval_metric": "mlogloss",
}
steps = 20

# Run training
model = xgb.train(
    xgb_params,
    dtrain,
    steps,
    evals=[(dtrain, "train")]
)

# Save for some usage
evals_result = model["history"]
booster = model["booster"]

# Predict results
prediction = xgb.predict(model, dtest)

Modes of a data distribution

Modin XGBoost provides two approaches for an internal data ditribution which could be switched by evenly_data_distribution parameter of train/predict functions:

  • evenly_data_distribution = True: in this case the input data of train/predict functions will be distributed evenly between nodes in a cluster to ensure evenly utilization of nodes (default behavior).
  • evenly_data_distribution = False : in this case partitions of input data of train/predict functions will not transfer between nodes in cluster in case empty nodes is <10%, if portion of empty nodes is ≥10% evenly data distribution will be applied. This method provides minimal data transfers between nodes but doesn’t guarantee effective utilization of nodes. Most effective in case when all cluster nodes are occupied by data.