Distributed XGBoost on Modin#

Modin provides an implementation of distributed XGBoost machine learning algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or interfaces could be changed.

Install XGBoost on Modin#

Modin comes with all the dependencies except xgboost package by default. Currently, distributed XGBoost on Modin is only supported on the Ray execution engine, therefore, see the installation page for more information on installing Modin with the Ray engine. To install xgboost package you can use pip:

pip install xgboost

XGBoost Train and Predict#

Distributed XGBoost functionality is placed in modin.experimental.xgboost module. modin.experimental.xgboost provides a drop-in replacement API for train and Booster.predict xgboost functions.

Module holds public interfaces for Modin XGBoost.

modin.experimental.xgboost.train(params: Dict, dtrain: DMatrix, *args, evals=(), num_actors: Optional[int] = None, evals_result: Optional[Dict] = None, **kwargs)

Run distributed training of XGBoost model.

During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain over nodes, some partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.

Parameters:

params (dict) – Booster params.
dtrain (modin.experimental.xgboost.DMatrix) – Data to be trained against.
*args (iterable) – Other parameters for xgboost.train.
evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.
evals_result (dict, optional) – Dict to store evaluation results in.
**kwargs (dict) – Other parameters are the same as xgboost.train.

Returns:

A trained booster.

Return type:

modin.experimental.xgboost.Booster

class modin.experimental.xgboost.Booster(params=None, cache=(), model_file=None)

A Modin Booster of XGBoost.

Booster is the model of XGBoost, that contains low level routines for training, prediction and evaluation.

Parameters:

params (dict, optional) – Parameters for boosters.
cache (list, default: empty) – List of cache items.
model_file (string/os.PathLike/xgb.Booster/bytearray, optional) – Path to the model file if it’s string or PathLike or xgb.Booster.

predict(data: DMatrix, **kwargs)

Run distributed prediction with a trained booster.

During execution it runs xgb.predict on each worker for subset of data and creates Modin DataFrame with prediction results.

Parameters:

data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.
**kwargs (dict) – Other parameters are the same as for xgboost.Booster.predict.

Returns:

Modin DataFrame with prediction results.

Return type:

modin.pandas.DataFrame

ModinDMatrix#

Data is passed to modin.experimental.xgboost functions via a Modin DMatrix object.

Module holds public interfaces for Modin XGBoost.

class modin.experimental.xgboost.DMatrix(data, label=None, missing=None, silent=False, feature_names=None, feature_types=None, feature_weights=None, enable_categorical=None)

DMatrix holds references to partitions of Modin DataFrame.

On init stage unwrapping partitions of Modin DataFrame is started.

Parameters:

data (modin.pandas.DataFrame) – Data source of DMatrix.
label (modin.pandas.DataFrame or modin.pandas.Series, optional) – Labels used for training.
missing (float, optional) – Value in the input data which needs to be present as a missing value. If None, defaults to np.nan.
silent (boolean, optional) – Whether to print messages during construction or not.
feature_names (list, optional) – Set names for features.
feature_types (list, optional) – Set types for features.
feature_weights (array_like, optional) – Set feature weights for column sampling.
enable_categorical (boolean, optional) – Experimental support of specializing for categorical features.

Notes

Currently DMatrix doesn’t support weight, base_margin, nthread, group, qid, label_lower_bound, label_upper_bound parameters.

property feature_names

Get column labels.

Return type:: Column labels.

property feature_types

Get column types.

Return type:: Column types.

get_dmatrix_params()

Get dict of DMatrix parameters excluding self.data/self.label.

Return type:: dict

get_float_info(name)

Get float property from the DMatrix.

Parameters:: name (str) – The field name of the information.
Return type:: A NumPy array of float information of the data.

num_col()

Get number of columns.

Return type:: int

num_row()

Get number of rows.

Return type:: int

set_info(*, label=None, feature_names=None, feature_types=None, feature_weights=None) → None

Set meta info for DMatrix.

Parameters:

label (modin.pandas.DataFrame or modin.pandas.Series, optional) – Labels used for training.
feature_names (list, optional) – Set names for features.
feature_types (list, optional) – Set types for features.
feature_weights (array_like, optional) – Set feature weights for column sampling.

Currently, the Modin DMatrix supports modin.pandas.DataFrame only as an input.

A Single Node / Cluster setup#

The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions.

To start the Ray runtime on a single node:

import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()

If you already had the Ray cluster you can connect to it by next way:

import ray
ray.init(address='auto')

A detailed information about initializing the Ray runtime you can find in starting ray page.

Usage example#

In example below we train XGBoost model using the Iris Dataset and get prediction on the same data. All processing will be in a single node mode.

from sklearn import datasets

import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init() # Start the Ray runtime for single-node

import modin.pandas as pd
import modin.experimental.xgboost as xgb

# Load iris dataset from sklearn
iris = datasets.load_iris()

# Create Modin DataFrames
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)

# Create DMatrix
dtrain = xgb.DMatrix(X, y)
dtest = xgb.DMatrix(X, y)

# Set training parameters
xgb_params = {
    "eta": 0.3,
    "max_depth": 3,
    "objective": "multi:softprob",
    "num_class": 3,
    "eval_metric": "mlogloss",
}
steps = 20

# Create dict for evaluation results
evals_result = dict()

# Run training
model = xgb.train(
    xgb_params,
    dtrain,
    steps,
    evals=[(dtrain, "train")],
    evals_result=evals_result
)

# Print evaluation results
print(f'Evals results:\n{evals_result}')

# Predict results
prediction = model.predict(dtest)

# Print prediction results
print(f'Prediction results:\n{prediction}')