Distributed XGBoost on Modin (experimental)

Modin provides an implementation of distributed XGBoost machine learning algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or interfaces could be changed.

Install XGBoost on Modin

Modin comes with all the dependencies except xgboost package by default. Currently, distributed XGBoost on Modin is only supported on the Ray backend, therefore, see the installation page for more information on installing Modin with the Ray backend. To install xgboost package you can use pip:

pip install xgboost

XGBoost Train and Predict

Distributed XGBoost functionality is placed in modin.experimental.xgboost module. modin.experimental.xgboost provides a drop-in replacement API for train and Booster.predict xgboost functions.

Module holds public interfaces for Modin XGBoost.

modin.experimental.xgboost.train(params: Dict, dtrain: modin.experimental.xgboost.xgboost.DMatrix, *args, evals=(), num_actors: Optional[int] = None, evals_result: Optional[Dict] = None, **kwargs)

Run distributed training of XGBoost model.

During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain over nodes, some partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.

Parameters
  • params (dict) – Booster params.

  • dtrain (modin.experimental.xgboost.DMatrix) – Data to be trained against.

  • *args (iterable) – Other parameters for xgboost.train.

  • evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.

  • num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.

  • evals_result (dict, optional) – Dict to store evaluation results in.

  • **kwargs (dict) – Other parameters are the same as xgboost.train.

Returns

A trained booster.

Return type

modin.experimental.xgboost.Booster

class modin.experimental.xgboost.Booster(params=None, cache=(), model_file=None)

A Modin Booster of XGBoost.

Booster is the model of XGBoost, that contains low level routines for training, prediction and evaluation.

Parameters
  • params (dict, optional) – Parameters for boosters.

  • cache (list, default: empty) – List of cache items.

  • model_file (string/os.PathLike/xgb.Booster/bytearray, optional) – Path to the model file if it’s string or PathLike or xgb.Booster.

predict(data: modin.experimental.xgboost.xgboost.DMatrix, num_actors: Optional[int] = None, **kwargs)

Run distributed prediction with a trained booster.

During work it evenly distributes data between workers, runs xgb.predict on each worker for subset of data and creates Modin DataFrame with prediction results.

Parameters
  • data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.

  • num_actors (int, optional) – Number of actors for prediction. If unspecified, this value will be computed automatically.

  • **kwargs (dict) – Other parameters are the same as xgboost.Booster.predict.

Returns

Modin DataFrame with prediction results.

Return type

modin.pandas.DataFrame

ModinDMatrix

Data is passed to modin.experimental.xgboost functions via a Modin DMatrix object.

Module holds public interfaces for Modin XGBoost.

class modin.experimental.xgboost.DMatrix(data, label)

DMatrix holds references to partitions of Modin DataFrame.

On init stage unwrapping partitions of Modin DataFrame is started.

Parameters
  • data (modin.pandas.DataFrame) – Data source of DMatrix.

  • label (modin.pandas.DataFrame or modin.pandas.Series) – Labels used for training.

Notes

Currently DMatrix supports only data and label parameters.

Currently, the Modin DMatrix supports modin.pandas.DataFrame only as an input.

A Single Node / Cluster setup

The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions.

To start the Ray runtime on a single node:

import ray
ray.init()

If you already had the Ray cluster you can connect to it by next way:

import ray
ray.init(address='auto')

A detailed information about initializing the Ray runtime you can find in starting ray page.

Usage example

In example below we train XGBoost model using the Iris Dataset and get prediction on the same data. All processing will be in a single node mode.

from sklearn import datasets

import ray
ray.init() # Start the Ray runtime for single-node

import modin.pandas as pd
import modin.experimental.xgboost as xgb

# Load iris dataset from sklearn
iris = datasets.load_iris()

# Create Modin DataFrames
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)

# Create DMatrix
dtrain = xgb.DMatrix(X, y)
dtest = xgb.DMatrix(X, y)

# Set training parameters
xgb_params = {
    "eta": 0.3,
    "max_depth": 3,
    "objective": "multi:softprob",
    "num_class": 3,
    "eval_metric": "mlogloss",
}
steps = 20

# Create dict for evaluation results
evals_result = dict()

# Run training
model = xgb.train(
    xgb_params,
    dtrain,
    steps,
    evals=[(dtrain, "train")],
    evals_result=evals_result
)

# Print evaluation results
print(f'Evals results:\n{evals_result}')

# Predict results
prediction = model.predict(dtest)

# Print prediction results
print(f'Prediction results:\n{prediction}')