Photo by Mathew Schwartz on Unsplash

Accelerate Kaggle Challenges Using Intel® Extension for Scikit-learn

There’s a Faster Way to Run scikit-learn

Kirill Petrov
4 min readNov 10, 2020

--

A few months ago, some of my colleagues showed how to improve machine learning performance using the optimized scikit-learn (SKL), which is part of the Intel oneAPI AI Analytics Toolkit (AI Kit):

Installing Intel® Extension for Scikit-learn and changing a couple of lines of code significantly improved times for the scikit‑learn_bench benchmark. Internally, the extension uses Intel oneAPI Data Analytics Library (oneDAL) to deliver best performance.

The AI Kit includes all of Intel’s scikit-learn optimizations and is distributed through many common channels, including Intel’s website, YUM, APT, and Anaconda, as well as independently through PyPI and the Anaconda Intel channel. The AI Kit provides a consolidated package of Intel’s latest deep and machine learning optimizations all in one place with seamless interoperability and high performance to streamline end to end data science and AI workflows on Intel architectures. Select and download the distribution package that you prefer and follow the Get Started Guide for post-installation instructions.

Alternatively, you can independently install just Intel® Extension for Scikit-learn package via conda or pip, like this:

$ pip install scikit-learn-intelex
$ conda install scikit-learn-intelex -c conda-forge

Once installed, you can accelerate SKL applications in either of two ways. First, you can simply load the Intel® Extension for Scikit-learn module from the Python command-line, i.e.:

$ python -m sklearnex your_application.py

This is fine for testing and experimentation, but you can also patch scikit-learn inside your Python program before importing scikit-learn modules, i.e.:

from sklearnex import patch_sklearn 
patch_sklearn()
from sklearn.svm import SVC # your usual code without any changes

We applied this patch to the Jupyter notebooks from various Kaggle challenges to demonstrate the performance gains that are possible for real-world workloads (Table 1). Intel® Extension for Scikit-learn improves the performance of these notebooks by as much as 227x, with little or no code modification. Most of the notebooks experience double-digit speedups. Even the least impressive speedup is still twice as fast as the baseline performance.

Table 1. The Intel-optimized version of scikit-learn yields significant performance improvements for common machine learning algorithms.
  • kddcup99-knn-classification (KDD Cup 1999): This computer networking (CN) challenge asked competitors to build a predictive model to detect unauthorized network intrusions. This notebook used the k-nearest neighbors (KNN) algorithm.
  • p2-sklearn-svm-hyperparameter-optimization (Credit Card Default): This challenge asked competitors to predict the likelihood of credit card payment default. This notebook used the support vector clustering (SVC) algorithm.
  • digit-recognition-using-knn [Digit Recognizer (KNN)] and gridsearch-svc [Digit Recognizer (SVC)]: This is the familiar image classification (IC) problem of recognizing hand-drawn digits. These notebooks used the KNN, SVC, and principal component analysis (PCA) algorithms.
  • melanoma-svc-32x32 (Melanoma Identification): This IC challenge asked competitors to identify cancerous lesions in medical images. This notebook used the SVC algorithm.
  • xg-svm-tf-cooking (What’s Cooking?): Competitors were asked to predict cuisine category based on the list of ingredients in the dish. The notebook for this natural language processing (NLP) challenge used SVC and the XGBoost implementation of the gradient boosted decision trees algorithm.
  • transformer-svm-semantically-identical-tweets (Real or Not? Disaster Tweets): This NLP task challenges competitors to predict whether a Tweet is about a real disaster. This notebook used SVC.
  • random-forest-k-fold-cross-validation (Home Credit Default): This challenge asked competitors to predict whether an applicant would be able to repay a home loan. This notebook used the random forest algorithm.

Intel® Extension for Scikit-learn adds more accelerated scikit-learn algorithms to each release, learn what algorithms are optimized here. Don’t leave performance on the table when using SKL. Accelerate these workloads with AI Kit to see the performance improvements for yourself.

Hardware and Software

Intel Xeon Gold 5218 @ 2.3 GHz (2nd generation Intel Xeon Scalable processors): 2 sockets, 16 cores per socket, HT:off, Turbo:off. OS: Red Hat Enterprise Linux 8.0, total memory of 193 GB (12 slots/16 GB/2933 MHz). Software: Scikit-learn 0.23.2, daal4py 2020.3, Numpy 1.18.5, and XGBoost 1.2.1.

--

--