Distributed XGBoost on Modin (experimental)

Modin provides an implementation of distributed XGBoost machine learning algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or interfaces could be changed.

Install XGBoost on Modin

Modin comes with all the dependencies except xgboost package by default. Currently, distributed XGBoost on Modin is only supported on the Ray backend, therefore, see the installation page for more information on installing Modin with the Ray backend. To install xgboost package you can use pip:

pip install xgboost

XGBoost Train and Predict

Distributed XGBoost functionality is placed in modin.experimental.xgboost module. modin.experimental.xgboost provides a drop-in replacement API for train and Booster.predict xgboost functions.

modin.experimental.xgboost.train(params: Dict[KT, VT], dtrain: modin.experimental.xgboost.xgboost.DMatrix, *args, evals=(), num_actors: Optional[int] = None, evals_result: Optional[Dict[KT, VT]] = None, **kwargs)

Train XGBoost model.

Parameters:
  • params (dict) – Booster params.
  • dtrain (DMatrix) – Data to be trained against.
  • evals (list of pairs (DMatrix, string)) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
  • num_actors (int. Default is None) – Number of actors for training. If it’s None, this value will be computed automatically.
  • evals_result (dict. Default is None) – Dict to store evaluation results in.
  • **kwargs – Other parameters are the same as xgboost.train.
Returns:

A trained booster.

Return type:

modin.experimental.xgboost.Booster

class modin.experimental.xgboost.Booster(params=None, cache=(), model_file=None)

A Modin Booster of XGBoost.

Booster is the model of xgboost, that contains low level routines for training, prediction and evaluation.

Parameters:
  • params (dict. Default is None) – Parameters for boosters.
  • cache (list) – List of cache items.
  • model_file (string/os.PathLike/Booster/bytearray) – Path to the model file if it’s string or PathLike.
predict(data: modin.experimental.xgboost.xgboost.DMatrix, num_actors: Optional[int] = None, **kwargs)

Run prediction with a trained booster.

Parameters:
  • data (DMatrix) – Input data used for prediction.
  • num_actors (int. Default is None) – Number of actors for prediction. If it’s None, this value will be computed automatically.
  • **kwargs – Other parameters are the same as xgboost.Booster.predict.
Returns:

Modin DataFrame with prediction results.

Return type:

modin.pandas.DataFrame

ModinDMatrix

Data is passed to modin.experimental.xgboost functions via a Modin DMatrix object.

class modin.experimental.xgboost.DMatrix(data, label)

DMatrix holding on references to DataFrame.

Parameters:
  • data (DataFrame) – Data source of DMatrix.
  • label (DataFrame) – Labels used for training.

Notes

Currently DMatrix supports only data and label parameters.

Currently, the Modin DMatrix supports modin.pandas.DataFrame only as an input.

A Single Node / Cluster setup

The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions.

To start the Ray runtime on a single node:

import ray
ray.init()

If you already had the Ray cluster you can connect to it by next way:

import ray
ray.init(address='auto')

A detailed information about initializing the Ray runtime you can find in starting ray page.

Usage example

In example below we train XGBoost model using the Iris Dataset and get prediction on the same data. All processing will be in a single node mode.

from sklearn import datasets

import ray
ray.init() # Start the Ray runtime for single-node

import modin.pandas as pd
import modin.experimental.xgboost as xgb

# Load iris dataset from sklearn
iris = datasets.load_iris()

# Create Modin DataFrames
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)

# Create DMatrix
dtrain = xgb.DMatrix(X, y)
dtest = xgb.DMatrix(X, y)

# Set training parameters
xgb_params = {
    "eta": 0.3,
    "max_depth": 3,
    "objective": "multi:softprob",
    "num_class": 3,
    "eval_metric": "mlogloss",
}
steps = 20

# Create dict for evaluation results
evals_result = dict()

# Run training
model = xgb.train(
    xgb_params,
    dtrain,
    steps,
    evals=[(dtrain, "train")],
    evals_result=evals_result
)

# Print evaluation results
print(f'Evals results:\n{evals_result}')

# Predict results
prediction = model.predict(dtest)

# Print prediction results
print(f'Prediction results:\n{prediction}')