Build Accelerated Kubeflow Pipelines with XGBoost on an Azure Confidential Computing Cluster

Accelerate Your Kubeflow Pipelines with Intel® Optimization for XGBoost and Intel® oneDAL

6 min readAug 21, 2023

[Re-published with permission from Intel]

This tutorial picks up where the article Install Kubeflow with Confidential Computing VMs on Microsoft Azure left off. After installing Kubeflow* on your Microsoft Azure* Kubernetes Services (AKS) cluster, this tutorial will walk through how to build and deploy a Kubeflow pipeline to predict the probability of a loan default using Intel® Optimization for XGBoost* and Intel® oneAPI Data Analytics Library (oneDAL) to accelerate training and inference. The full code for this tutorial can be found here. Figure 1 shows a graph of the full pipeline.

Figure 1: Sample Kubeflow pipeline. Image from author.

The Kubeflow pipeline consists of the following 7 components:

Load data: Load the dataset (credit_risk_dataset.csv) from the URL specified in the pipeline run parameters, and performs synthetic data augmentation. Download the Kaggle* Credit Risk Dataset.
Create training and test sets: Split the data into training and test sets of an approximately 75:25 split for model evaluation.
Preprocess features: Transform the categorical features of the of the training and test sets by using one-hot encoding, in addition to imputing missing values and power-transforming numerical features.
Train XGBoost model: This component trains an XGBoost model using the accelerations provided by the Intel Optimizations for XGBoost.
Convert XGBoost model to daal4py: Convert the XGBoost model to an inference-optimized daal4py classifier.
daal4py Inference: Compute predictions using the inference-optimized daal4py classifier and evaluate model performance. It returns an output summary of the precision, recall, and F1 score for each class, as well as the area under the curve (AUC) and accuracy score of the model.
Plot ROC Curve: Perform model validation on the test data and generate a plot of the receiver operating characteristic (ROC) curve.

The code for the pipeline can be found here: intel-xgboost-daal4py-pipeline-azure.py. Before running the Python* script, ensure that you have installed version 2.0 or above of the Kubeflow Pipelines SDK. To update your SDK version, you can use:

pip install -U kfp

Intel® Software Guard Extensions

The Kubeflow pipeline leverages secure and confidential computing nodes using Intel® Software Guard Extensions (Intel® SGX) virtual machines on the Azure Kubernetes cluster. The pipeline pods are assigned to an Intel SGX node by adding a nodeSelector to the pipeline task. The node selector will look for a node with a matching label key/value pair when scheduling the pods. When creating the Intel SGX node pool on the AKS cluster, the following node label parameter can be used:

--labels intelvm=sgx

This label can be referenced in the pipeline code for scheduling pods onto an Intel SGX node. The code below will assign the data preprocessing pipeline task to an Intel SGX node:

from kfp import kubernetes
 
kubernetes.add_node_selector(task = preprocess_features_op,  
label_key = 'intelvm', label_value = 'sgx')

Pipeline Components

The first component in the pipeline, load_data, downloads the credit_risk_dataset.csv from the URL provided in the pipeline run parameters and synthetically augments the data to the size specified. It then saves the new dataset as an output artifact in the Kubeflow MinIO* volume.

This dataset is then read into the next step of the component, create_train_test_set, which subsets the data into an approximately 75:25 split for model evaluation. The training and test sets, X_train, y_train, X_test and y_test, are then saved as output dataset artifacts.

In the preprocess_features component, a data preprocessing pipeline is created to prepare the data for the XGBoost model. This component loads the X_train and X_test files from the MinIO storage and transforms the categorical features using one-hot encoding, imputes missing values, and power-transforms the numerical features.

The next component then trains the XGBoost model. Using XGBoost on Intel CPUs takes advantage of software accelerations powered by oneAPI, without requiring any code changes. To get the highest amount of optimization for Intel CPUs, ensure that you are using the latest version of XGBoost. In the initial testing of incremental training updates of size 1 million, Intel® Optimization for XGBoost v1.4.2 offers up to 1.54x speedup over stock XGBoost v0.81. The following code snippet is implemented in the train_xgboost_model component:

# Create the XGBoost DMatrix, which is optimized for memory efficiency and training speed 
dtrain = xgb.DMatrix(X_train.values, y_train.values) 
 
# Define the model parameters 
params = {"objective": "binary:logistic", 
"eval_metric": "logloss", 
"nthread": 4,  # num_cpu 
"tree_method": "hist", 
"learning_rate": 0.02, 
"max_depth": 10, 
"min_child_weight": 6, 
"n_jobs": 4,  # num_cpu, 
"verbosity": 1}     
 
# Train XGBoost model 
clf = xgb.train(params = params,
dtrain = dtrain,
num_boost_round = 500)

To further optimize the model prediction speed, we convert the trained XGBoost model into an inference-optimized daal4py classifier in the convert_xgboost_to_daal4py. daal4py is the Python API of the oneAPI Data Analytics Library (oneDAL). In testing batch inference of size 1 million, oneDAL provided a speedup of up to 4.44x. You can convert the XGBoost model to daal4py with one line of code:

# Convert XGBoost model to daal4py  
daal_model = d4p.get_gbt_model_from_xgboost(clf)

Then use the daal4py model to evaluate the model’s performance on the test set in the daal4py_inference component.

# Compute binary class labels and probabilities  
daal_prediction = d4p.gbt_classification_prediction(
nClasses = 2,  
resultsToEvaluate = "computeClassLabels|computeClassProbabilities" 
).compute(X_test, daal_model)

We compute both the binary class labels and probabilities by calling the gbt_classification_prediction method in the code above. You can also calculate the log probabilities with this function. This component will return the classification report in CSV format, as well as two Kubeflow metrics artifacts: the area under the curve and the accuracy of the model. These can be viewed in the Visualization tab of the metrics artifact.

In the last component of the pipeline, plot_roc_curve, we load the prediction data with the probabilities and the true class labels to calculate the ROC curve using the CPU-accelerated version from the Intel® Extension for Scikit-Learn*. The results are stored as a Kubeflow ClassificationMetric Artifact, which can be viewed in the Visualization tab of the roc_curve_daal4py artifact. Once your pipeline has finished running, you should see a similar ROC plot as in Figure 2.

Figure 2: ROC Curve from Kubeflow pipeline. Image from author.

Voila! We have successfully run an optimized Kubeflow pipeline with XGBoost and on an Intel SGX node on Microsoft Azure. Be sure to check out the resources below for help on your implementation.

Additional Resources

Access the full source code on GitHub here.
Register for Office Hours here for help on your implementation.
Learn more about all of our Intel Cloud Optimization Modules here.
Come chat with us on our Intel DevHub Discord server to keep interacting with fellow developers.
View the full performance results of the Loan Default Reference Kit here.

References

This work was re-published for Medium from Kelli Belcher’s article found here.

Want to Connect?

Get in touch with me on LinkedIn if you have any questions.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.