Artificial Intelligence

Distributed XGBoost with Modin on Ray

Scale pandas and XGBoost by Changing a Single Line of Code

Alexey Prutskov

Published in

Intel Analytics Software

5 min readJul 20, 2021

Authors: Alexey Prutskov and Yaroslav Igoshev, Intel Corporation

This blog demonstrates distributed XGBoost training and prediction. XGBoost is a gradient boosting library that focuses on tree models. It supports distributed training, in which a model is trained with a subset of data on each worker. This is enabled by syncing the histogram of gradients for each tree node across the workers. XGBoost by itself can’t perform feature engineering or allocate workers. Dedicated libraries and higher-level distributed computing frameworks like Modin and Ray are required.

Modin is a scalable DataFrame library that uses the pandas API:

Data Science at Scale with Modin

The Intel Distribution of Modin in the Intel AI Analytics Toolkit Enables Scalable Data Analytics

medium.com

We have chosen to partition Modin DataFrames along column and row indices for maximum flexibility and scalability (Figure 1). Currently, each partition’s memory format is a pandas DataFrame. Future Modin versions will support additional in-memory formats for the backend (i.e., Arrow tables).

Modin can accelerate pandas queries on both single- and multi-node (i.e., clusters) systems by changing a single line of code:

# import pandas as pd
import modin.pandas as pd

The performance advantage of Modin DataFrames is demonstrated using the following common operations on the HIGGS dataset:

Reading a CSV file (Figure 2): df = pd.read_csv("HIGGS.csv")
Grouping data (Figure 3): df.groupby(df.columns[0]).count()
Finding a maximum (Figure 4): df.max()

Figure 4. Find the maximum value over the rows

Modin XGBoost

In addition to distributing common DataFrame operations, Modin XGBoost provides a drop-in replacement API for xgboost.train, xgboost.Booster.predict and xgboost.DMatrix. This makes it possible to distribute XGBoost training and prediction by changing a single line of code:

# import xgboost as xgb
import modin.experimental.xgboost as xgb

Installation instructions for Modin and its XGBoost implementation can be found here. Currently, Modin XGBoost uses Ray to distribute the DataFrame partitions among workers.

Modin XGBoost Training

As mentioned previously, Modin supports single- and multi-node execution. Ray was used for the latter. We used these instructions to manually start a Ray cluster. The following example uses a three-node cluster. After the cluster setup, we can load the data:

import modin.pandas as pd
import ray
ray.init(address="auto")
X_y = pd.read_csv("HIGGS.csv")

Figure 5 shows a possible data partitioning among the cluster nodes. Ray doesn’t ensure even data distribution, but this is easily fixed, as we’ll see.

Figure 5. Uneven distribution of the Modin DataFrame partitions among the cluster nodes

The next step is DMatrix construction, in which data and label are provided to the constructor as Modin Dataframes:

dmatrix = xgb.DMatrix(X_y.iloc[:,:-1], X_y.iloc[:,-1])

Now, we can start training:

model = xgb.train({}, dmatrix, num_actors=6)

In this example, six Ray actors are created to perform training in parallel. An actor is a separate process that can access and mutate the state of that worker (see the Ray documentation for more information). Its mission is local training and prediction on a portion of data. If the number of actors isn’t set using the num_actors parameter, it will be computed automatically to use the available resources of the Ray cluster as evenly as possible. In this example, there are two actors per node, with two DataFrame partitions per actor (Figure 6).

Figure 6. Even distribution of the Modin DataFrame partitions among the nodes

Next, the partitions in each actor will be concatenated and the XGBoost DMatrix will be created (Figure 7). Each actor trains the XGBoost model on its DMatrix, then the results are combined into a single model.

Figure 7. Even distribution of the DMatrix among the nodes

Modin XGBoost Prediction

Prediction is largely the same as training except for two differences. First, the layout of DataFrame partitions among actors is different (Figure 8, compared to Figure 6). This is necessary to preserve the correct order of local predictions. Second, actors do not need to synchronize with each other during prediction. Each actor stores a partial prediction result (pandas Dataframe) in the distributed storage. The result of the prediction is the distributed Modin Dataframe created from the partial results of each actor.

Figure 8. Distribution of Modin DataFrame partitions during prediction

Modin XGBoost Scalability

Modin XGBoost is designed to scale on a cluster, as Figures 9 and 10 demonstrate. Scalability isn’t perfect, but adding more nodes to the Ray cluster decreases training time and especially prediction time. Training and prediction both used the HIGGS benchmark running on an AWS EMR cluster of m5.4xlarge instances (16 vCPUs and 64 GB memory per instance). The following training parameters were used: tree_method=”hist”, eval_metric=[“logloss”, “error”], num_boost_round=400, evals= [(training_dm, “train”)], num_actors=2 per node.

Figure 9. Scalability of Modin XGBoost training on the HIGGS dataset

Figure 10. Scalability of Modin XGBoost prediction on the HIGGS dataset

It is worth noting that Modin XGBoost is in active development with respect to functional coverage, scalability, and performance. Improvements are constantly being made to the software. Currently, Modin XGBoost is implemented with Ray execution engine but support for other execution engines used in Modin are being added. Information about supported features can be found in the Modin XGBoost documentation.