Hitaya OneAPI: Breast Cancer Machine Learning Model Training Using Intel oneAPI

Raj
5 min readApr 30, 2023

--

In this article, we will be covering the approach to our solution under healthcare for underserved communities. This article continues that series, wherein we run down steps of making our predictions leveraging machine learning using the oneAPI analytics toolkit.

We will demonstrate the steps we used for the Breast Cancer detection of a patient. We follow classical machine-learning techniques for tabular data and later fine-tune them using Intel’s OneAPI AI Analytics Toolkit & libraries and Intel DevCloud which lets developers use a wide range of readily accessible tools.

Deployment cycle of ML/AI solutions for clinical care

Data Collection & Preparation

Any machine learning use case starts with data gathering. We take the help of open-source and readily available data from Kaggle datasets. This particular Breast Cancer was collected from Wisconsin (Diagnostic) Data Set.

Data Dictionary — The dataset consists of several medical predictor variables and one target variable, Outcome. This is a classification problem. The parameters are the following:


1) ID number

2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

Now let us proceed with some Exploratory Data Analysis. Here we use Intel’s Modin which helps us get faster results than traditional pandas data frames. Modin works the same as pandas with additional enhanced execution time. We also import other commonly used data science libraries — numpy for n-dimensional array calculations, for visualization — matplotlib & seaborn.


# importing libraries

import modin.pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import time

It’s time to use one more of Intel’s blazing performance-enhancing products — scikit-learn-intelex. This is easy and simple to use with a 2-liner code and needs no change to be made to existing code. The following code shall patch and provide the accelerated power of execution time. After the patch is included, as usual, the sklearn library is imported.

from sklearnex import patch_sklearn
patch_sklearn()

For EDA, we first read the data and try to get some insights out of it. We tend to look out for the types of data it holds (strings, floats, integers, dates, etc.), check on the statistics, co-related parameters, null values, and other checks

# Reading data
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
%time
df.info()
%time

Checking target column distribution

malignant_count = len(df.loc[df['diagnosis'] == 'M'])
benign_count = len(df.loc[df['diagnosis'] == 'B'])
import seaborn as sns
df.hist(figsize = (30,30))

Working on some data cleaning, and pre-processing

# removing unknown column
df.drop('Unnamed: 32',axis = 1, inplace = True)

Plotting the Correlation Graph

%matplotlib inline
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df1[top_corr_features].corr(),annot=True,cmap="RdYlGn")
# separating feature cols & target
feature_columns = df1.loc[:, df1.columns != 'diagnosis']
target_column = df1['diagnosis']

feature_columns.shape,target_column.shape

X = feature_columns
y = target_column

Decomposing large data using Principal Component Analysis and Scaling with StandardScalar

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

pca = PCA(n_components=3)
scaler = StandardScaler()

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
plt.figure(figsize=(8,6))
plt.scatter(X_train[:,0],X_train[:,1],c=y_train,cmap='plasma')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

Training the model with SVC

# model fitting with support vector classifier
from sklearn.svm import SVC
model = SVC(kernel='poly', degree=2, gamma='auto')
model.fit(X_train, y_train)


predict_train_data = model.predict(X_test)
from sklearn import metrics
print("Accuracy Using OneAPI= {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
%time
Accuracy Using OneAPI= 0.789
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.06 µs

SVC didn’t yield good results

Trying with RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)
model = random_forest_model.fit(X_train, y_train)


predict_train_data = model.predict(X_test)
from sklearn import metrics
print("Accuracy Using OneAPI = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predict_train_data)
print(cm)
%time
Accuracy Using OneAPI = 0.930
CPU times: user 0 ns, sys: 4 µs, total: 4 µs
Wall time: 8.82 µs

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 8.58 µs

Evaluation Metrics

As the score appears, the model performs exceptionally well with the help of the Intel One API AI Analytics Toolkit and Libraries. We display results from other metrics available as well to take out precision, recall, f1-score & confusion matrix.

Saving the model, with the help of joblib in pkl format to be used later for prediction.

# saving model 
import joblib
joblib.dump(model, "./RF_breast_cancer_OneAPI.joblib")
%time
from sklearnex import unpatch_sklearn
unpatch_sklearn()

Conclusion

After the model is ready, it is saved and ready for deployment along with machine learning pipelines being created which are connected to API endpoints. Using Intel’s AI libraries has been a boon in performance and execution times. A link to our previous article covering the overview of our problem statement and API structure is given below.

--

--