Distributed XGBoost on Modin ============================ Modin provides an implementation of `distributed XGBoost`_ machine learning algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or interfaces could be changed. Install XGBoost on Modin ------------------------ Modin comes with all the dependencies except ``xgboost`` package by default. Currently, distributed XGBoost on Modin is only supported on the Ray execution engine, therefore, see the :doc:`installation page ` for more information on installing Modin with the Ray engine. To install ``xgboost`` package you can use ``pip``: .. code-block:: bash pip install xgboost XGBoost Train and Predict ------------------------- Distributed XGBoost functionality is placed in ``modin.experimental.xgboost`` module. ``modin.experimental.xgboost`` provides a drop-in replacement API for ``train`` and ``Booster.predict`` xgboost functions. .. automodule:: modin.experimental.xgboost :noindex: :members: train .. autoclass:: modin.experimental.xgboost.Booster :noindex: :members: predict ModinDMatrix ------------ Data is passed to ``modin.experimental.xgboost`` functions via a Modin ``DMatrix`` object. .. automodule:: modin.experimental.xgboost :noindex: :members: DMatrix Currently, the Modin ``DMatrix`` supports ``modin.pandas.DataFrame`` only as an input. A Single Node / Cluster setup ----------------------------- The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions. To start the Ray runtime on a single node: .. code-block:: python import ray # Look at the Ray documentation with respect to the Ray configuration suited to you most. ray.init() If you already had the Ray cluster you can connect to it by next way: .. code-block:: python import ray ray.init(address='auto') A detailed information about initializing the Ray runtime you can find in `starting ray`_ page. Usage example ------------- In example below we train XGBoost model using `the Iris Dataset`_ and get prediction on the same data. All processing will be in a `single node` mode. .. code-block:: python from sklearn import datasets import ray # Look at the Ray documentation with respect to the Ray configuration suited to you most. ray.init() # Start the Ray runtime for single-node import modin.pandas as pd import modin.experimental.xgboost as xgb # Load iris dataset from sklearn iris = datasets.load_iris() # Create Modin DataFrames X = pd.DataFrame(iris.data) y = pd.DataFrame(iris.target) # Create DMatrix dtrain = xgb.DMatrix(X, y) dtest = xgb.DMatrix(X, y) # Set training parameters xgb_params = { "eta": 0.3, "max_depth": 3, "objective": "multi:softprob", "num_class": 3, "eval_metric": "mlogloss", } steps = 20 # Create dict for evaluation results evals_result = dict() # Run training model = xgb.train( xgb_params, dtrain, steps, evals=[(dtrain, "train")], evals_result=evals_result ) # Print evaluation results print(f'Evals results:\n{evals_result}') # Predict results prediction = model.predict(dtest) # Print prediction results print(f'Prediction results:\n{prediction}') .. _Dataframe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html .. _`starting ray`: https://docs.ray.io/en/master/starting-ray.html .. _`the Iris Dataset`: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html .. _`distributed XGBoost`: https://medium.com/intel-analytics-software/distributed-xgboost-with-modin-on-ray-fc17edef7720