Modin XGBoost module description¶
High-level Module Overview¶
This module holds classes, public interface and internal functions for distributed XGBoost in Modin.
Public classes Booster
, DMatrix
and function train()
provide the user with familiar XGBoost interfaces.
They are located in the modin.experimental.xgboost.xgboost
module.
The internal module modin.experimental.xgboost.xgboost.xgboost_ray
contains the implementation of Modin XGBoost
for the Ray execution engine. This module mainly consists of the Ray actor-class ModinXGBoostActor
,
a function to distribute Modin’s partitions between actors _assign_row_partitions_to_actors()
,
an internal _train()
/_predict()
function used from the public interfaces and additional util functions for computing cluster resources, actor creations etc.
Public interfaces¶
DMatrix
inherits original class xgboost.DMatrix
and overrides
its constructor, which currently supports only data and label parameters. Both of the parameters must
be modin.pandas.DataFrame
, which will be internally unwrapped to lists of delayed objects of Modin’s
row partitions using the function unwrap_partitions()
.
- class modin.experimental.xgboost.DMatrix(data, label=None, missing=None, silent=False, feature_names=None, feature_types=None, feature_weights=None, enable_categorical=None)¶
DMatrix holds references to partitions of Modin DataFrame.
On init stage unwrapping partitions of Modin DataFrame is started.
- Parameters
data (modin.pandas.DataFrame) – Data source of DMatrix.
label (modin.pandas.DataFrame or modin.pandas.Series, optional) – Labels used for training.
missing (float, optional) – Value in the input data which needs to be present as a missing value. If
None
, defaults tonp.nan
.silent (boolean, optional) – Whether to print messages during construction or not.
feature_names (list, optional) – Set names for features.
feature_types (list, optional) – Set types for features.
feature_weights (array_like, optional) – Set feature weights for column sampling.
enable_categorical (boolean, optional) – Experimental support of specializing for categorical features.
Notes
Currently DMatrix doesn’t support weight, base_margin, nthread, group, qid, label_lower_bound, label_upper_bound parameters.
- property feature_names¶
Get column labels.
- Returns
- Return type
Column labels.
- property feature_types¶
Get column types.
- Returns
- Return type
Column types.
- get_dmatrix_params()¶
Get dict of DMatrix parameters excluding self.data/self.label.
- Returns
- Return type
dict
- get_float_info(name)¶
Get float property from the DMatrix.
- Parameters
name (str) – The field name of the information.
- Returns
- Return type
A NumPy array of float information of the data.
- num_col()¶
Get number of columns.
- Returns
- Return type
int
- num_row()¶
Get number of rows.
- Returns
- Return type
int
- set_info(*, label=None, feature_names=None, feature_types=None, feature_weights=None) None ¶
Set meta info for DMatrix.
- Parameters
label (modin.pandas.DataFrame or modin.pandas.Series, optional) – Labels used for training.
feature_names (list, optional) – Set names for features.
feature_types (list, optional) – Set types for features.
feature_weights (array_like, optional) – Set feature weights for column sampling.
Booster
inherits original class xgboost.Booster
and
overrides method predict
. The difference from original class interface for predict
method is changing the type of the data parameter to DMatrix
.
- class modin.experimental.xgboost.Booster(params=None, cache=(), model_file=None)¶
A Modin Booster of XGBoost.
Booster is the model of XGBoost, that contains low level routines for training, prediction and evaluation.
- Parameters
params (dict, optional) – Parameters for boosters.
cache (list, default: empty) – List of cache items.
model_file (string/os.PathLike/xgb.Booster/bytearray, optional) – Path to the model file if it’s string or PathLike or xgb.Booster.
- predict(data: modin.experimental.xgboost.xgboost.DMatrix, **kwargs)¶
Run distributed prediction with a trained booster.
During execution it runs
xgb.predict
on each worker for subset of data and creates Modin DataFrame with prediction results.- Parameters
data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.
**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
Modin DataFrame with prediction results.
- Return type
modin.pandas.DataFrame
train()
function has 2 differences from the original train
function - (1) the
data type of dtrain parameter is DMatrix
and (2) a new parameter num_actors.
- modin.experimental.xgboost.train(params: Dict, dtrain: modin.experimental.xgboost.xgboost.DMatrix, *args, evals=(), num_actors: Optional[int] = None, evals_result: Optional[Dict] = None, **kwargs)¶
Run distributed training of XGBoost model.
During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain over nodes, some partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.
- Parameters
params (dict) – Booster params.
dtrain (modin.experimental.xgboost.DMatrix) – Data to be trained against.
*args (iterable) – Other parameters for xgboost.train.
evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.
evals_result (dict, optional) – Dict to store evaluation results in.
**kwargs (dict) – Other parameters are the same as xgboost.train.
- Returns
A trained booster.
- Return type
Internal execution flow on Ray engine¶
Internal functions _train()
and
_predict()
work similar to xgboost.
Training¶
The data is passed to
_train()
function as aDMatrix
object. Using an iterator ofDMatrix
, lists ofray.ObjectRef
with row partitions of Modin DataFrame are exctracted. Example:# Extract lists of row partitions from dtrain (DMatrix object) X_row_parts, y_row_parts = dtrain
On this step, the parameter num_actors is processed. The internal function
_get_num_actors()
examines the value provided by the user. In case the value isn’t provided, the num_actors will be computed using condition that 1 actor should use maximum 2 CPUs. This condition was chosen for using maximum parallel workers with multithreaded XGBoost training (2 threads per worker will be used in this case).
Note
num_actors parameter is made available for public function train()
to allow
fine-tuning for obtaining the best performance in specific use cases.
ModinXGBoostActor
objects are created.Data dtrain is split between actors evenly. The internal function
_split_data_across_actors()
runs assigning row partitions to actors using internal function_assign_row_partitions_to_actors()
. This function creates a dictionary in the form: {actor_rank: ([part_i0, part_i3, ..], [0, 3, ..]), ..}.
Note
_assign_row_partitions_to_actors()
takes into account IP
addresses of row partitions of dtrain data to minimize excess data transfer.
For each
ModinXGBoostActor
objectset_train_data
method is called remotely. This method runs loading row partitions in actor according to the dictionary with partitions distribution from previous step. When data is passed to the actor, the row partitions are automatically materialized (ray.ObjectRef
->pandas.DataFrame
).train
method ofModinXGBoostActor
class object is called remotely. This method runs XGBoost training on local data of actor, connects toRabit Tracker
for sharing training state between actors and returns dictionary with booster and evaluation results.At the final stage results from actors are returned. booster and evals_result are returned using
ray.get
function from remote actor.
Prediction¶
The data is passed to
_predict()
function as aDMatrix
object._map_predict()
function is applied remotely for each partition of the data to make a partial prediction.Result
modin.pandas.DataFrame
is created fromray.ObjectRef
objects, obtained in the previous step.
Internal API¶
- class modin.experimental.xgboost.xgboost_ray.ModinXGBoostActor(rank, nthread)¶
Ray actor-class runs training on the remote worker.
- Parameters
rank (int) – Rank of this actor.
nthread (int) – Number of threads used by XGBoost in this actor.
- _get_dmatrix(X_y, **dmatrix_kwargs)¶
Create xgboost.DMatrix from sequence of pandas.DataFrame objects.
First half of X_y should contains objects for X, second for y.
- Parameters
X_y (list) – List of pandas.DataFrame objects.
**dmatrix_kwargs (dict) – Keyword parameters for
xgb.DMatrix
.
- Returns
A XGBoost DMatrix.
- Return type
xgb.DMatrix
- add_eval_data(*X_y, eval_method, **dmatrix_kwargs)¶
Add evaluation data for actor.
- Parameters
*X_y (iterable) – Sequence of ray.ObjectRef objects. First half of sequence is for X data, second for y. When it is passed in actor, auto-materialization of ray.ObjectRef -> pandas.DataFrame happens.
eval_method (str) – Name of eval data.
**dmatrix_kwargs (dict) – Keyword parameters for
xgb.DMatrix
.
- set_train_data(*X_y, add_as_eval_method=None, **dmatrix_kwargs)¶
Set train data for actor.
- Parameters
*X_y (iterable) – Sequence of ray.ObjectRef objects. First half of sequence is for X data, second for y. When it is passed in actor, auto-materialization of ray.ObjectRef -> pandas.DataFrame happens.
add_as_eval_method (str, optional) – Name of eval data. Used in case when train data also used for evaluation.
**dmatrix_kwargs (dict) – Keyword parameters for
xgb.DMatrix
.
- train(rabit_args, params, *args, **kwargs)¶
Run local XGBoost training.
Connects to Rabit Tracker environment to share training data between actors and trains XGBoost booster using self._dtrain.
- Parameters
rabit_args (list) – List with environment variables for Rabit Tracker.
params (dict) – Booster params.
*args (iterable) – Other parameters for xgboost.train.
**kwargs (dict) – Other parameters for xgboost.train.
- Returns
A dictionary with trained booster and dict of evaluation results as {“booster”: xgb.Booster, “history”: dict}.
- Return type
dict
- modin.experimental.xgboost.xgboost_ray._assign_row_partitions_to_actors(actors: List, row_partitions, data_for_aligning=None)¶
Assign row_partitions to actors.
row_partitions will be assigned to actors according to their IPs. If distribution isn’t even, partitions will be moved from actor with excess partitions to actor with lack of them.
- Parameters
actors (list) – List of used actors.
row_partitions (list) – Row partitions of data to assign.
data_for_aligning (dict, optional) – Data according to the order of which should be distributed row_partitions. Used to align y with X.
- Returns
Dictionary of assigned to actors partitions as {actor_rank: (partitions, order)}.
- Return type
dict
- modin.experimental.xgboost.xgboost_ray._train(dtrain, params: Dict, *args, num_actors=None, evals=(), **kwargs)¶
Run distributed training of XGBoost model on Ray engine.
During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain by nodes, part of partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.
- Parameters
dtrain (modin.experimental.DMatrix) – Data to be trained against.
params (dict) – Booster params.
*args (iterable) – Other parameters for xgboost.train.
num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.
evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will be evaluated during training. Validation metrics will help us track the performance of the model.
**kwargs (dict) – Other parameters are the same as xgboost.train.
- Returns
A dictionary with trained booster and dict of evaluation results as {“booster”: xgboost.Booster, “history”: dict}.
- Return type
dict
- modin.experimental.xgboost.xgboost_ray._predict(booster, data, **kwargs)¶
Run distributed prediction with a trained booster on Ray engine.
During execution it runs
xgb.predict
on each worker for subset of data and creates Modin DataFrame with prediction results.- Parameters
booster (xgboost.Booster) – A trained booster.
data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.
**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
Modin DataFrame with prediction results.
- Return type
modin.pandas.DataFrame
- modin.experimental.xgboost.xgboost_ray._get_num_actors(num_actors=None)¶
Get number of actors to create.
- Parameters
num_actors (int, optional) – Desired number of actors. If is None, integer number of actors will be computed by condition 2 CPUs per 1 actor.
- Returns
Number of actors to create.
- Return type
int
- modin.experimental.xgboost.xgboost_ray._split_data_across_actors(actors: List, set_func, X_parts, y_parts)¶
Split row partitions of data between actors.
- Parameters
actors (list) – List of used actors.
set_func (callable) – The function for setting data in actor.
X_parts (list) – Row partitions of X data.
y_parts (list) – Row partitions of y data.
- modin.experimental.xgboost.xgboost_ray._map_predict(booster, part, columns, dmatrix_kwargs={}, **kwargs)¶
Run prediction on a remote worker.
- Parameters
booster (xgboost.Booster or ray.ObjectRef) – A trained booster.
part (pandas.DataFrame or ray.ObjectRef) – Partition of full data used for local prediction.
columns (list or ray.ObjectRef) – Columns for the result.
dmatrix_kwargs (dict, optional) – Keyword parameters for
xgb.DMatrix
.**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
ray.ObjectRef
with partial prediction.- Return type
ray.ObjectRef