Hands On Guide to Intel® AI Analytics Toolkit

Tamal Acharya
10 min readApr 3, 2022

--

Continuing from my first blog on OpenVINO (Intel® Distribution of OpenVINO™ toolkit — — Optimised Deep Learning | by TAMAL ACHARYA | Mar, 2022 | Medium), I will introduce you all to the Intel® oneAPI AI Analytics Toolkit (AI Kit). This toolkit is a must have for data scientists, analysts or anyone who is working on machine learning, deep learning and trying to build and train models.

What is Intel® AI Analytics Toolkit?

Maximizes performance from pre-processing through machine learning, and provides interoperability for efficient model development.

The Intel® oneAPI AI Analytics Toolkit (AI Kit) gives data scientists, AI developers, and researchers familiar Python* tools and frameworks to accelerate end-to-end data science and analytics pipelines on Intel® architectures. The components are built using oneAPI libraries for low-level compute optimizations. This toolkit maximizes performance from pre-processing through machine learning, and provides interoperability for efficient model development.

(Source: Intel® oneAPI AI Analytics Toolkit for Data Science)

Brief Architecture:

Below is a short summary of the architecture and what’s included in the toolkit.

Source: Performance Optimizations for End-to-End AI Pipelines | by Meena Arunachalam | Intel Analytics Software | Medium

  • Intel Distribution for Python: Intel’s Python build. This contains a lot of standard numerical python packages. Including things like numpy, scipy, etc. It’s a very good and useful base python distribution. Achieve greater performance through acceleration of core Python numerical and scientific packages that are built using Intel® Performance Libraries. This package includes Numba Compiler*, a just-in-time compiler for decorated Python code that allows the latest Single Instruction Multiple Data (SIMD) features and multicore execution to fully use modern CPUs. You can program multiple devices using the same programming model, DPPy (Data Parallel Python) without rewriting CPU code to device code.
  • Intel Optimization for TensorFlow: Intel’s optimized build of TensorFlow. In collaboration with Google*, TensorFlow has been directly optimized for Intel architecture using the primitives of Intel® oneAPI Deep Neural Network Library (oneDNN) to maximize performance. This package provides the latest TensorFlow binary version compiled with CPU-enabled settings ( — config=mkl).
  • Intel Optimization for PyTorch: Intel’s optimized build of PyTorch. In collaboration with Facebook*, this popular deep-learning framework is now directly combined with many optimizations from Intel to provide superior performance on Intel® architecture. This package provides the binary version of latest PyTorch release for CPUs, and further adds extensions and bindings from Intel with Intel® oneAPI Collective Communications Library (oneCCL) for efficient distributed training.
  • Intel Distribution of Modin: Intel’s optimized build of Modin (a fast parallel alternative to Pandas and Dask). Accelerate your pandas workflows and scale data preprocessing across multi-nodes using this intelligent, distributed DataFrame library with an identical API to pandas. The library integrates with OmniSci in the back end for accelerated analytics. This component is available only via the Anaconda* distribution of the toolkit. To download and install it, refer to the Installation Guide.
  • Model Zoo for Intel Architecture: GitHub repository of pre trained models optimized for running on Intel architecture. Accelerate your pandas workflows and scale data preprocessing across multi-nodes using this intelligent, distributed DataFrame library with an identical API to pandas. The library integrates with OmniSci in the back end for accelerated analytics. This component is available only via the Anaconda* distribution of the toolkit. To download and install it, refer to the Installation Guide.
  • Intel Low Precision Optimization Tool: This is a python library to assist with deploying fast low-precision models for inference. It takes advantage of Intel DL Boost and Vector Neural Network Instructions (VNNI) that are part of second generation Xeon Scalable using AVX512.
  • Intel® Neural Compressor: Provide a unified, low-precision inference interface across multiple deep-learning frameworks optimized by Intel with this open-source Python library.

The below are the Python conda package bundles for the components listed above:

· Intel Aikit Tensorflow :: Anaconda.org

· Intel Aikit Pytorch :: Anaconda.org

· Intel Aikit Modin :: Anaconda.org

Now for Machine Learning 101 using the Intel AI Analytics Toolkit:

You might ask why use the Intel® oneAPI AI Analytics Toolkit (AI Kit) when I can simply run them in our normal laptops with normal python packages and hardware. Well you are in for a surprise!! (Can’t wait. Tell me more!!)

What if I tell you that Intel® oneAPI AI Analytics Toolkit (AI Kit) did well to accelerate performances of scikit-learn package in Kaggle challenges. Wait what?

Still don’t believe me. See for yourself (Accelerate Kaggle Challenges Using Intel® Extension for Scikit-learn | by Kirill Petrov | Intel Analytics Software | Medium)

The Intel-optimized version of scikit-learn yields significant performance improvements for common machine learning algorithms.

Okay enough of teasing. Let’s get some hands dirty and show the power of Using Intel® Extension for Scikit-learn. For the purpose of the blog I have chosen Support Vector Machine (SVM). But you can play with any other algorithms of your choice.

Installing Intel® Distribution for Python and Intel® Performance Libraries with Anaconda

· Please follow the steps to install: Installing Intel® Distribution for Python* and Intel® Performance…

Learning Objectives:

  • Apply support vector machines (SVMs) — a popular algorithm used for classification problems
  • Recognize SVM similarity to logistic regression
  • Compute the cost function of SVMs
  • Apply regularization in SVMs and some tips to obtain non-linear classifications with SVMs
  • Apply Intel® Extension for Scikit-learn* to leverage underlying compute capabilities of hardware

This will demonstrate how to apply the Intel® Extension for Scikit-learn,* a seamless way to speed up your Scikit-learn application. The acceleration is achieved through the use of the Intel® oneAPI Data Analytics Library (oneDAL). Patching is the term used to extend scikit-learn with Intel optimizations and makes it a well-suited machine learning framework for dealing with real-life problems.

To get optimized versions of many Scikit-learn algorithms using a patch() approach consisting of adding these lines of code after importing sklearn:

  • from sklearnex import patch_sklearn
  • patch_sklearn()

Support Vector Machines and Kernels

(Code credits: Intel AI Academy — Machine Learning 101 Course)

Introduction

We will be using the wine quality data set for the hands on. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3–9, with highest being better) and a color (red or white).

  • Import the data.
  • Create the target variable y as a 1/0 column where 1 means red.
  • Create a pairplot for the dataset.
  • Create a bar plot showing the correlations between each column and y
  • Pick the most 2 correlated fields (using the absolute value of correlations) and create X
  • Use MinMaxScaler to scale X. Note that this will output a np.array. Make it a DataFrame again and rename the columns appropriately.
from __future__ import print_function
import os
data_path = [ 'data']
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.kernel_approximation import Nystroem
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearnex import patch_sklearn
patch_sklearn()

After you run the above code snippets, you will get the below message:

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

Loading the dataset and checking correlations:

import pandas as pd
import numpy as np
filepath = os.sep.join(data_path + ['Wine_Quality_Data.csv'])
data = pd.read_csv(filepath, sep=',')
y = (data['color'] == 'red').astype(int)
fields = list(data.columns[:-1]) # everything except "color"
correlations = data[fields].corrwith(y)
correlations.sort_values(inplace=True)
correlations

Output from the above code snippets:

total_sulfur_dioxide   -0.700357
free_sulfur_dioxide -0.471644
residual_sugar -0.348821
citric_acid -0.187397
quality -0.119323
alcohol -0.032970
pH 0.329129
density 0.390645
fixed_acidity 0.486740
sulphates 0.487218
chlorides 0.512678
volatile_acidity 0.653036
dtype: float64

Now for basic Exploratory Data Analysis (EDA):

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('talk')
sns.set_palette('dark')
sns.set_style('white')
sns.pairplot(data, hue='color')

Output from the above is this beautiful chart:

<seaborn.axisgrid.PairGrid at 0x1b3098b5a88>
#Pearson Correlation chart
ax = correlations.plot(kind='bar')
ax.set(ylim=[-1, 1], ylabel='pearson correlation');
fields = correlations.map(abs).sort_values().iloc[-2:].index
print(fields)
X = data[fields]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X = pd.DataFrame(X, columns=['%s_scaled' % fld for fld in fields])
print(X.columns)
Index(['volatile_acidity', 'total_sulfur_dioxide'], dtype='object')
Index(['volatile_acidity_scaled', 'total_sulfur_dioxide_scaled'], dtype='object')

The goal now is to look at the decision boundary of a LinearSVC classifier on this dataset. Check out this example in sklearn’s documentation.

  • Fit a Linear Support Vector Machine Classifier to X, y.
  • Pick 300 samples from X. Get the corresponding y value. Store them in variables X_color and y_color. This is because original dataset is too large and it produces a crowded plot.
  • Modify y_color so that it has the value "red" instead of 1 and 'yellow' instead of 0.
  • Scatter plot X_color’s columns. Use the keyword argument “color=y_color” to color code samples.
  • Feel free to experiment with different parameter choices for LinearSVC and see the decision boundary.
LSVC = LinearSVC()
LSVC.fit(X, y)
X_color = X.sample(300, random_state=45)
y_color = y.loc[X_color.index]
y_color = y_color.map(lambda r: 'red' if r == 1 else 'yellow')
ax = plt.axes()
ax.scatter(
X_color.iloc[:, 0], X_color.iloc[:, 1],
color=y_color, alpha=1)
# -----------
x_axis, y_axis = np.arange(0, 1.005, .005), np.arange(0, 1.005, .005)
xx, yy = np.meshgrid(x_axis, y_axis)
xx_ravel = xx.ravel()
yy_ravel = yy.ravel()
X_grid = pd.DataFrame([xx_ravel, yy_ravel]).T
y_grid_predictions = LSVC.predict(X_grid)
y_grid_predictions = y_grid_predictions.reshape(xx.shape)
ax.contourf(xx, yy, y_grid_predictions, cmap=plt.cm.autumn_r, alpha=.3)
# -----------
ax.set(
xlabel=fields[0],
ylabel=fields[1],
xlim=[0, 1],
ylim=[0, 1],
title='decision boundary for LinearSVC');

Output of the above code snippets:

Let’s now fit a Gaussian kernel SVC and see how the decision boundary changes.

We have to consolidate the code snippets from above into one function which takes in an estimator, X and y, and produces the final plot with decision boundary. The steps are:

  1. fit model
  2. get sample 300 records from X and the corresponding y’s
  3. create grid, predict, plot using ax.contourf
  4. add on the scatter plot
  • After copying and pasting code, make sure the finished function uses your input estimator and not the LinearSVC model you built.
  • For the following values of gamma, create a Gaussian Kernel SVC and plot the decision boundary.
    gammas = [.5, 1, 2, 10]
  • Holding gamma constant, for various values of C, plot the decision boundary. You may try
    Cs = [.1, 1, 10]
def plot_decision_boundary(estimator, X, y):
estimator.fit(X, y)
X_color = X.sample(300)
y_color = y.loc[X_color.index]
y_color = y_color.map(lambda r: 'red' if r == 1 else 'yellow')
x_axis, y_axis = np.arange(0, 1, .005), np.arange(0, 1, .005)
xx, yy = np.meshgrid(x_axis, y_axis)
xx_ravel = xx.ravel()
yy_ravel = yy.ravel()
X_grid = pd.DataFrame([xx_ravel, yy_ravel]).T
y_grid_predictions = estimator.predict(X_grid)
y_grid_predictions = y_grid_predictions.reshape(xx.shape)
fig, ax = plt.subplots(figsize=(10, 10))
ax.contourf(xx, yy, y_grid_predictions, cmap=plt.cm.autumn_r, alpha=.3)
ax.scatter(X_color.iloc[:, 0], X_color.iloc[:, 1], color=y_color, alpha=1)
ax.set(
xlabel=fields[0],
ylabel=fields[1],
title=str(estimator))
gammas = [.5, 1, 2, 10]
for gamma in gammas:
SVC_Gaussian = SVC(kernel='rbf', gamma=gamma)
plot_decision_boundary(SVC_Gaussian, X, y)
Cs = [.1, 1, 10]
for C in Cs:
SVC_Gaussian = SVC(kernel='rbf', gamma=2, C=C)
plot_decision_boundary(SVC_Gaussian, X, y)

Here, we will compare the fitting times between SVC vs Nystroem with rbf kernel.

Jupyter Notebooks provide a useful magic function %timeit which executes a line and prints out the time it took to fit. If you type %%timeit in the beginning of the cell, then it will run the whole cell and output the running time.

  • Re-load the wine quality data if you made changes to the original.
  • Create y from data.color, and X from the rest of the columns.
  • Use %%timeit to get the time for fitting an SVC with rbf kernel.
  • Use %%timeit to get the time for the following: fit_transform the data with Nystroem and then fit a SGDClassifier.

Nystroem+SGD will take much shorter to fit. This difference will be more pronounced if the dataset was bigger.

  • Make 5 copies of X and concatenate them
  • Make 5 copies of y and concatenate them
  • Compare the time it takes to fit the both methods above
y = data.color == 'red'
X = data[data.columns[:-1]]
kwargs = {'kernel': 'rbf'}
svc = SVC(**kwargs)
nystroem = Nystroem(**kwargs)
sgd = SGDClassifier()
%%timeit
svc.fit(X, y)
435 ms ± 28.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeit
X_transformed = nystroem.fit_transform(X)
sgd.fit(X_transformed, y)
75.8 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)X2 = pd.concat([X]*5)
y2 = pd.concat([y]*5)
print(X2.shape)
print(y2.shape)
(32485, 12)
(32485,)
%timeit svc.fit(X2, y2)10.8 s ± 797 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%%timeit
X2_transformed = nystroem.fit_transform(X2)
sgd.fit(X2_transformed, y2)
313 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So that’s all folks! Please play around with different hyperparameters and datasets and fit SVM and check the results.

Happy Coding!

#oneAPI

Additional References:

Intel® AI Analytics Toolkit Resources

  1. Intel® oneAPI AI Analytics Toolkit (AI Kit)

Intel® AI Analytics Toolkit Training Modules

  1. https://youtu.be/8lUYVm4cbOA

Articles References:

  1. Accelerate Kaggle Challenges Using Intel® Extension for Scikit-learn
  2. Performance Optimizations for End-to-End AI Pipelines

--

--

Tamal Acharya

AI professional and practitioner, AI/ML enthusiast. Part time researcher in AI, AGI and Quantum Computing especially QML, QNN