Artificial Intelligence

Scalable AutoXGBoost Using Analytics Zoo AutoML

Build Machine Learning Pipelines with Less Effort

Wesley Du

Published in

Intel Analytics Software

5 min readJul 6, 2021

Authors: Wesley Du (wesley.du@intel.com), Ding Ding (ding.ding@intel.com), Shan Yu (shan.yu@intel.com), Yabai Hu (yabai.hu@intel.com), Shengsheng Huang (shengsheng.huang@intel.com), and Jason Dai (jason.dai@intel.com)

Machine learning (ML) is widely used in many real-world applications like computer vision, natural language processing, and time series forecasting but choosing the right ML model, training the model, and tuning it for best performance is a tedious and time-consuming process. Consequently, automated ML (AutoML) is gaining popularity. AutoML is the process of automating the tasks in the ML pipeline from the inputting raw dataset to model deployment. It allows non-experts to create simpler ML solutions more quickly without sacrificing model accuracy.

This blog will introduce the Analytics Zoo AutoML framework and demonstrate its use with an AutoXGBoost example. We compared AutoXGBoost to a similar XGBoost and Ray Tune solution on Nvidia A100, the training with AutoXGBoost is ~1.7x faster by elapsed time, and the final model is more accurate.

Analytics Zoo AutoML

Analytics Zoo is a scalable, open-source platform for end-to-end data analytics. It implements an AutoML framework that provides users with an easy and efficient way to build ML applications by automating various stages of the pipeline including feature generation, model selection, exploring model configurations, and eventually suggesting a model with the best accuracy. There are currently four basic components in the Analytics Zoo AutoML framework: FeatureTransformer, Model, SearchEngine, and Pipeline (Figure 1).

The Analytics Zoo AutoML framework uses Ray Tune (running on top of RayOnSpark) for hyperparameter search. In our implementation, hyperparameter search covers both feature engineering and modeling. For feature engineering, the search engine selects the best subset of features that are automatically generated by various feature generation tools. For modeling, the search engine searches for hyperparameters such as number of nodes per layer, learning rate, etc. For building and training the models, popular deep learning frameworks like Tensorflow and Keras are used. In addition, we use Apache Spark and Ray for distributed execution where necessary.

A typical AutoML training workflow proceeds as follows:

A FeatureTransformer and a Model are first instantiated. A SearchEngine is then instantiated and configured with the FeatureTransformer and Model, along with search presets (which specify how the hyperparameters are searched, the reward metric, etc.).
The SearchEngine runs the search procedure. Each run will generate several trials at a time and distribute the trials to a cluster using Ray Tune. Each trial performs feature engineering and the model-fitting process with a different combination of hyperparameters and returns the specified metrics.
When all trials are finished, the best set of hyperparameters and an optimized model are retrieved. They are used to generate the FeatureTransformer and Model, which are in turn used to compose a Pipeline that can be saved to file and loaded later for inference and/or incremental training.

AutoXGBoost with Analytics Zoo AutoML

XGBoost is a popular gradient boosting library that provides excellent model accuracy. We have implemented AutoXGBoost in the Analytics Zoo AutoML framework to automatically fit and optimize XGBoost models. The following steps show how to use AutoXGBoost to automate hyperparameter optimization, after which we will compare training time and model accuracy to a similar XGBoost and Ray Tune solution.

Step 1: Data Loading and Preprocessing

Our comparison will use a large, publicly available airline dataset containing flight arrival and departure details for commercial flights within the USA. The model will predict whether a flight’s arrival will be delayed. The necessary fields from the airline dataset are loaded into a pandas dataframe. A new field, ArrDelayBinary, is added to each flight record and set to true if the flight arrived beyond the delayed_threshold and false otherwise.

Step 2: Prepare Context

We need to prepare a context before using Analytics Zoo AutoML to train a model:

from zoo.orca import init_orca_context, stop_orca_contextinit_orca_context(cluster_mode="local", cores=112, memory='20g',
                  init_ray_on_spark=True)

Step 3: Create an AutoXGBoost Classifier

We’re building a classification model, so we’ll create an AutoXGBoost classifier with a few fixed hyperparameters for XGBoost:

config = {"tree_method":'hist', "learning_rate":0.1, "gamma":0.1,
          "min_child_weight":30, "reg_lambda":1,
          "scale_pos_weight":2, "subsample":1, "n_jobs":56}Auto_xgb_clf = AutoXGBClassifier(cpus_per_trial=4,
                                 name=”auto_xgb_classifier”,
                                 **config)

Step 4: Train the AutoXGBoost Classifier

We train the classifier on the input data using AutoXGBoost.fit with AutoML:

search_space = {“n_estimators”: hp.grid_search([50, 1000]),
                “max_depth”: hp.randint(2, 15)}auto_xgb_clf.fit(data=(X_train, y_train),
                 validation_data=(X_val, y_val),
                 metric="error",
                 metric_mode=”min”,
                 n_sampling=2,
                 search_space=search_space)

Step 5: Retrieve the Best Model from AutoXGBoost

Finally, we get best model for evaluation:

best_model = auto_xgb_clf.get_best_model()accuracy = best_model.evaluate(X_val, y_val, metrics=["accuracy"])

AutoXGBoost Performance

The Analytics Zoo AutoXGBoost pipeline is verified on a dual-socket 3rd Generation Intel Xeon Platinum Scalable Processor 8368 server. With 30 training trials, the accuracy of the validation dataset can be improved from 78% to 85% with a total training time of 135 seconds (Figure 3). We compare this to XGBoost and Ray Tune on a Nvidia A100 GPU using the same XGBoost classifier hyperparameters. By the end of 30 trials, the accuracy of the validation dataset is 84% with a total training time of 230 seconds.

Figure 3. Performance improvement from AutoXGBoost on Intel Xeon processors

Performance tests were run by Intel using the following configurations:

Concluding Remarks

Analytics Zoo provides an AutoML framework that significantly reduces the effort required to build ML pipelines. In this blog, we demonstrated how AutoXGBoost in Analytics Zoo AutoML delivers great performance in both model training time and model accuracy.

You are encouraged to try the Analytics Zoo AutoML framework. For more information, visit the Analytics Zoo project at Github and take a look at the Analytics Zoo documentation.