Leverage Intel Optimizations in Scikit-Learn
Better SVM Performance for Model Training and Inference
Support vector machines (SVM) is a machine learning (ML) algorithm that achieves high accuracy for many practical tasks. However, the quadratic complexity of the algorithm makes SVM a compute-intensive approach, so it is important to use an optimized software stack. In this blog, I will compare ML software for CPUs and GPUs using publicly available datasets running on different AWS EC2 instances.
I showed in a previous blog that Intel Extension for Scikit-learn provides a fast CPU implementation of SVM, outperforming the stock scikit-learn and ThunderSVM implementations.
From Hours to Minutes: 600x Faster SVM
Patching scikit-learn for Better Machine Learning Performance
Intel Extension for Scikit-learn contains drop-in replacement patching functionality for scikit-learn. Let’s do another performance comparison, this time using probabilistic support vector classification (SVC) (Table 1). In practice, using Intel Extension for Scikit-learn reduces the training time from 14 hours to 10 minutes (a 96x speedup) for the covertype dataset and speeds prediction over 1000x for the codrnanorm dataset. This is pretty astonishing considering that there’s no loss in model quality.
RAPIDS cuML is the fastest SVM implementation for NVIDIA GPU, so we will use it to compare performance to Intel and AMD CPUs. To enable optimizations of Intel Extension for Scikit-learn, add just two lines of code before the usual scikit-learn imports:
from sklearnex import patch_sklearn
patch_sklearn()# the start of the user's code
from sklearn.svm import SVC
You can get the Intel Extension for Scikit-learn by downloading the Intel oneAPI AI Analytics Toolkit (AI Kit). If you prefer to get the optimizations independently of the AI Kit, you can use either Anaconda Cloud (Conda-Forge channel) or PyPI:
conda install scikit-learn-intelex -c conda-forgepip install scikit-learn-intelex
Intel vs. AMD CPUs
This performance comparison used the following AWS EC2 instances: c5.24xlarge (2nd Generation Intel Xeon Scalable processor) and c5a.24xlarge (AMD EPYC Rome). We compare the performance of Intel Extension for Scikit-learn across these instances without any environment or code changes (Figure 1). We see that the performance on the Xeon instance is better than the EPYC instance. Based on the geometric mean, it is 1.85x faster during training and 3.1x faster during prediction. The performance improvement can be explained by the availability of AVX-512 instructions and a faster NUMA interconnect on Intel architectures, which is enabled in Intel Extension for Scikit-learn. Moreover, the price:performance ratio of the c5.25xlarge instance ($4.08/hour) is significantly better than the c5a.24xlarge instance ($3.70/hour): $3.70/hour * 1.85 hours = $6.85 to do the same amount of training and $3.70/hour * 3.1 hours = $11.47 to do the same amount of prediction as the Xeon instance.
Intel CPU vs. NVIDIA GPU
This performance comparison used the same AWS EC2 c5.24xlarge Xeon instance and the p3.2xlarge instance (NVIDIA V100). We compared Intel Extension for Scikit-learn on Xeon to RAPIDS cuML on V100 (Figure 2). Based on the geometric mean, Xeon is 2.11x times faster for training and 2.74x times faster for prediction. As with the previous Xeon vs. EPYC comparison, the price:performance ratio of the c5.25xlarge instance ($4.08/hour) is significantly better than the p3.2xlarge instance ($3.06/hour): $3.06/hour * 2.11 hours = $6.46 to do the same amount of training and $3.06/hour * 2.74 hours = $8.38 to do the same amount of prediction as the Xeon instance.
We see that the Intel implementation has advantages. During both training and prediction, we have to work with multiple SVM models. The multiclass task uses the one-vs-one method, which requires quite a lot of models to be trained. Probability prediction requires training of five models using CalibratedClassifierCV. This is easy to parallelize on CPUs and is implemented in Intel Extension for Scikit-learn. It is much harder to implement on GPUs because it requires continuous data synchronization in global memory.
SVM also has sequential regions of code (e.g., the SMO solver) that cannot execute in parallel, which makes efficient execution on GPUs difficult. However, Intel® Extension for Scikit-learn applies smart data blocking to maximize cache reuse. This allows Xeon to outperform V100.
Using Intel Extension for Scikit-learn on Intel Xeon processors saves a lot of time and money on SVM model training and inference. It is up to 1000x faster than the stock scikit-learn with no loss in accuracy. Also, it does not require code changes. The competitive performance advantages are also clear (Figures 1 and 2). Xeon shows a 1.85x speedup for training and a 2.11x speedup for prediction compared to AMD EPYC. It shows a 2.11x speedup for training and a 2.74x speedup for prediction compared to NVIDIA V100. In both cases, the price:performance ratio of the Xeon instance is significantly better than the EPYC or V100 instances.
Hardware and Software Configurations
All tests were performed by the author on March 29, 2021.
Benchmarks are available in the scikit-learn_bench repository. Run the following command to reproduce the results with Intel Extension for Scikit-learn:
python runner.py --config configs/svm/svc_proba_sklearn.json
Run the following command to reproduce the results with RAPIDS cuML:
python runner.py --config configs/svm/svc_proba_cuml.json --no-intel-optimized
Notices and Disclaimers
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available options. Learn more at www.Intel.com/PerformanceIndex.
Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.