Why Pay More for Machine Learning?
Accelerate Your Imbalanced Learning Workloads with Intel Extension for Scikit-learn
Ethan Glaser, Nikolay Petrov, Henry Gabb, and Jui Mhatre, Intel Corporation
A recent NVIDIA blog caught our eye with its misleading results. What’s the point of comparing an A100 GPU to a nine-year-old CPU (the Intel Xeon E5–2698 was launched in 2014 and has since been discontinued) or comparing optimized CUDA code (the RAPIDS cuML library) to unoptimized, single-threaded Python code (stock scikit-learn with the imbalanced-learn library) unless you’re deliberately trying to inflate the GPU vs. CPU speedup? The imbalanced-learn library supports scikit-learn compatible estimators, so they used cuML estimators for acceleration. We can use the optimized estimators in Intel Extension for Scikit-learn just by adding a call to patch_sklearn():
from sklearnex import patch_sklearn
patch_sklearn()
from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.neighbors import NearestNeighbors
...
nn = NearestNeighbors(n_neighbors=4, n_jobs=-1)
X_resampled, y_resampled = EditedNearestNeighbours(n_neighbors=nn).fit_resample(X, y)
Let’s see what happens when we use an optimized version of scikit-learn on newer Xeon processors. Since we don’t have ready access to Xeon E5–2698 systems anymore, we’ll be using Google Cloud Platform (GCP) instances to run the same benchmarks that NVIDIA used in their blog. In general, you can choose a CPU that meets your performance and price requirements, so we’ll compare Intel Xeon-based instances from two price points to the A100 results (Table 1).
Performance Comparison
The Intel Extension for Scikit-learn gives speedups across the board for the same benchmarks as Nvidia (Figure 1). The speedups range from ~2x up to ~140x depending on the algorithm and parameters. Note that the stock scikit-learn library ran out of memory for SMOTE and ADASYN “100 features, 5 classes” benchmarks. If performance matters, these results demonstrate that Intel Extension for Scikit-learn delivers significant speedup over stock scikit-learn.
How does this compare to Nvidia’s A100 results? Let’s take a look at the two algorithms where Nvidia achieved the highest speedups over scikit-learn: SVMSMOTE and CondensedNearestNeighbours (Figure 2). These results show that our performance is on a similar order of magnitude as cuML when a newer processor and optimized scikit-learn are used for comparison. Intel Extension for Scikit-learn even outperforms cuML in some tests. Now, let’s talk about price.
Cost Comparison
It’s worth noting that the hourly cost of an a2-highgpu-1g A100 instance on GCP is 60% higher than the n2-highcpu-64 instance (Table 1). That means the A100 instance must deliver at least 1.6x speedup over the Xeon Gold 6268CL (n2-highcpu-64) instance to be cost-competitive. (An A100 also consumes 1.7x and 1.2x more power than Xeon E5–2696 v4 and Xeon Gold 6268CL, respectively, but we’ll put that aside for now because power consumption is baked into the instance cost.)
Let’s compare the price-to-performance ratios for the benchmarks selected by Nvidia to see if the A100 instance justifies its premium price. The total cost (USD) of a benchmark run is simply the instance cost per hour (USD/hr) times the runtime (hr). A detailed cost comparison shows that running these benchmarks on the Xeon instance is often the more cost-effective option (Figure 3). In the charts below, a value greater than one indicates that the given benchmark is more expensive on the A100 instance. For example, a value of 1.29 means the A100 instance is 29% more expensive than the Xeon instance.
Benchmark cost varies depending on the algorithm and parameters used, but the results generally favor the Xeon instance: the geometric mean of cost is greater than one for four out of the five algorithms and the overall geometric mean is 1.36 (Table 2).
Additionally, CPUs offer more flexibility in instance selection, which further improves efficiency. It is more cost effective to select the smallest capable Xeon instance that can handle a given problem size while satisfying performance requirements and budget constraints. Figure 4 shows one such example for the two smallest benchmarks. These results demonstrate that it can be significantly cheaper to run on the hardware that best matches the needs of the model configuration. For example, running the two ADASYN benchmarks with Intel Extension for Scikit-learn on an e2-highcpu-8 instance is only 1.5% and 2.1% the cost of running cuML on the A100 instance.
Conclusion
The results above demonstrate that the Intel Extension for Scikit-learn is capable of dramatically improving performance results compared to stock scikit-learn and is also capable of outperforming A100 in some tests. When cost is considered, the Intel Extension for Scikit-learn results are even more favorable because Xeon instances are so much cheaper than the A100 instance. Users can select a Xeon instance that meets their performance, power, and price requirements.