Performance Optimizations for End-to-End AI Pipelines

Optimized Frameworks and Libraries for Intel Processors

Published in

Intel Analytics Software

6 min readMay 6, 2021

Modern artificial intelligence (AI) and machine learning (ML) applications perform a range of tasks that convert raw data into valuable insights. Data scientists create multiphase, end-to-end pipelines for AI and ML applications (Figure 1). The phases include data ingestion, data cleaning, data exploration, and feature engineering followed by prototyping, model building, and finally, model deployment. The phases are often repeated many times, and it may be necessary to scale the entire pipeline across a cluster and or to deploy it to the cloud.

The Intel oneAPI AI Analytics Toolkit (AI Kit) provides high-performance APIs and Python packages to accelerate the phases of these pipelines (Figure 2).

Intel Distribution for Modin, with the OmniSci DB engine, provides a scalable pandas API by simply changing a single line of code. Modin significantly improves the performance and scalability of pandas DataFrame processing.

How to Speed Up Pandas with Modin

The pandas library provides easy-to-use data structures like pandas DataFrames as well as tools for data analysis. One…

medium.com

Data Science at Scale with Modin

The Intel Distribution of Modin in the Intel AI Analytics Toolkit Enables Scalable Data Analytics

medium.com

For classical ML training and inference, the AI Kit contains Intel Extension for Scikit-learn to accelerate common estimators (e.g., logistic regression, singular value decomposition, principal component analysis, etc.), transformers, and clustering algorithms (e.g., k-means, DBSCAN).

Accelerate Your scikit-learn Applications

Faster Experimentation with Predictable Behavior

medium.com

Intel Gives Scikit-Learn the Performance Boost Data Scientists Need

Faster Machine Learning on Intel Processors

medium.com

For gradient boosting, Intel also optimized the XGBoost and CatBoost libraries, which provide efficient parallel tree boosting used to solve many data science problems in a fast and accurate manner.

Optimizing XGBoost Training Performance

XGBoost Just Keeps Getting Faster

medium.com

Improving the Performance of XGBoost and LightGBM Inference

Get Up To 36x Faster Inference Using Intel oneAPI Data Analytics Library

medium.com

Let’s look at two examples where the AI Kit helps data scientists accelerate their AI pipelines:

Census: This workload trains a ridge regression model to predict education level using U.S. census data (1970 to 2010, published by IPUMS).
PLAsTiCC Astronomical Classification: This workload is an open data challenge on Kaggle with the aim to classify objects in the night sky. It uses simulated astronomical time series data to classify objects.

Both workloads have three broad phases:

Ingestion loads the numerical data into DataFrames.
Preprocessing and Transformation runs a variety of ETL operations to clean and prepare the data for modeling, such as dropping columns, dealing with missing values, type conversions, arithmetic operations, and aggregation.
Data Modeling creates separate training and test sets, model building and training, model validation, and inference.

Figure 3 shows the breakdown by phase for both workloads, which illustrates the importance of optimizing each phase to speed up the entire end-to-end pipeline.

Figure 3. Breakdown by phase for each workload

Figures 4 and 5 show the relative performance and the subsequent speedups for each phase using the Intel-optimized software stack (shown in blue) compared to the stock software (shown in orange). On 2nd-Generation Intel Xeon Scalable processors, the optimized software stack gives a 10x speedup on Census and an 18x speedup for PLAsTiCC compared to the stock software stack.

Figure 4. End-to-end performance of the Census workload showing the speedup for each phase

Figure 5. End-to-end performance of the PLAsTiCC workload showing the speedup for each phase

On Census, using the Intel Distribution for Modin instead of pandas gives a 7x speedup for readcsv and a 4x speedup for ETL operations. For training and prediction, using the Intel-optimized scikit-learn instead of the stock package gives a 9x speedup for the train_test_split function and a 59x speedup for training and inference. On PLAsTiCC, using the Intel Distribution for Modin gives a 69x speedup for readcsv and a 21x speedup for ETL operations. These speedups are achieved through a variety of optimizations in the AI Kit, including parallelization, vectorization, core scaling, improved memory layouts, cache reuse, cache-friendly blocking, efficient memory bandwidth usage, and more effective use of the processor instruction sets.

Figures 6 and 7 show the end-to-end performance of the two workloads on 2nd- and 3rd-Generation Intel Xeon Scalable processors compared to 2nd- and 3rd-Generation AMD EPYC processors and Nvidia Tesla V100 and A100 processors. The same optimized software stack is used on the Intel and AMD CPUs, while the RAPIDS stack is used on the Nvidia GPUs. The complete hardware and software configurations are included below.

Figure 6. Competitive performance for all phases of the Census pipeline

Figure 7. Competitive performance for all phases of the PLAsTiCC pipeline

In this article, we demonstrate significant performance boost (~10x-18x speedup) on Intel Xeon processors using the optimized packages included in the AI Kit, which are simple drop-in replacements for stock data analytics software. The results also show that CPUs and GPUs excel in different phases of the pipelines, but the 3rd-Generation Intel Xeon Platinum CPU 8380 outperforms the Nvidia V100 and is competitive with the Nvidia A100. The 3rd-Generation Xeon is also cheaper and more power-efficient. These observations reinforce the notion that generality is critical in data analytics.

End-to-End Data Analytics Performance

The End-to-End Data Analytics Workflow Requires Generality

medium.com

You can get Modin, XGboost, Intel Extension for Scikit-learn and other optimized Python software for Intel architectures through many common channels such as Intel’s website, YUM, APT, Anaconda, etc. Select and download the distribution package that you prefer and follow the Get Started Guide for post-installation instructions.

Hardware and Software Configurations

3rd Generation Intel Xeon Platinum 8380: dual-socket server, 40 cores per socket, 2.30 GHz base frequency, Intel Turbo mode enabled, hyperthreading enabled. OS: Ubuntu 20.04.1 LTS, 512GB RAM (16x 32GB 3200MHz), kernel: 5.4.0–64-generic, microcode: 0x8d055260, BIOS: SE5C620.86B.OR.64.2021.10.3.02.0417, CPU governor: performance.

2nd Generation Intel Xeon Platinum 8280L: dual-socket server, 28 cores per socket, 2.70 GHz base frequency, Intel Turbo mode enabled, hyperthreading enabled. OS: Ubuntu 20.04.1 LTS, 384GB RAM (12x 32GB 2933MHz), kernel: 5.4.0–65-generic, microcode: 0x4003003, BIOS: SE5C620.86B.OR.64.2020.51.2.04.0651, CPU governor: performance.

3rd Generation AMD EPYC 7763: dual-socket server, 64 cores per socket, 1.50 GHz base frequency, simultaneous multithreading enabled. OS: Red Hat Enterprise Linux 8.3 (Ootpa), 1024 GB RAM (16x 64GB 3200MHz), kernel: 4.18.0–240.el8.x86_64, microcode: 0xa001119, BIOS: Gigabyte version M03, CPU governor: performance.

2nd Generation AMD EPYC 7742: dual-socket server, 64 cores per socket, 1.50 GHz base frequency, simultaneous multithreading enabled. OS: Ubuntu 20.04.1 LTS, 512 GB RAM (16x 32GB 3200MHz), kernel: 5.4.0–62-generic, microcode: 0x8301038, BIOS: American Megatrends Inc 2.1c, CPU governor: performance.

Nvidia Tesla A100 GPU: Part of DGX-A100, dual-socket 2nd generation AMD EPYC 7742 host CPU. OS: Ubuntu 18.04.5 LTS, 512 GB RAM (16x 32GB 3200MHz), kernel: 5.4.0–42-generic, microcode: 0x8301034, BIOS revision 0.23, CPU governor: performance.

Nvidia Tesla V100 GPU: 32GB GPU, dual-socket 2nd generation Intel Xeon Platinum 8268 host CPU. OS: CentOS Linux release 7.8.2003, 384 GB RAM (12x 32GB 2933MHz), kernel: 5.4.69, microcode: 0x5003003, BIOS SE5C620.86B.OR.64.2020.51.2.04.0651, CPU governor: performance.

CPU SW: Scikit-learn 0.24.1 accelerated by daal4py 2021.2, modin 0.8.3, omniscidbe v5.4.1, Pandas 1.2.2, XGBoost 1.3.3, Python 3.9.7

GPU SW: Nvidia RAPIDS 0.17, CUDA Toolkit 11.0.221, Python 3.7.9

Performance Optimizations for End-to-End AI Pipelines

Optimized Frameworks and Libraries for Intel Processors

How to Speed Up Pandas with Modin

The pandas library provides easy-to-use data structures like pandas DataFrames as well as tools for data analysis. One…

Data Science at Scale with Modin

The Intel Distribution of Modin in the Intel AI Analytics Toolkit Enables Scalable Data Analytics

Accelerate Your scikit-learn Applications

Faster Experimentation with Predictable Behavior

Intel Gives Scikit-Learn the Performance Boost Data Scientists Need

Faster Machine Learning on Intel Processors

Optimizing XGBoost Training Performance

XGBoost Just Keeps Getting Faster

Improving the Performance of XGBoost and LightGBM Inference

Get Up To 36x Faster Inference Using Intel oneAPI Data Analytics Library

End-to-End Data Analytics Performance

The End-to-End Data Analytics Workflow Requires Generality

Hardware and Software Configurations

Written by Meena Arunachalam