Hitaya OneAPI: Heart Disease Machine Learning Model Training Using Intel oneAPI Extension for XGBoost

Raj
6 min readMay 16, 2023

--

In this article, we will be covering the approach to our solution under healthcare for underserved communities. This article continues that series, wherein we run down steps of making our heart disease predictions leveraging machine learning using the oneAPI analytics toolkit extension for XGBoost.

We will demonstrate the steps we used for the Heart Disease Risk Prediction of a patient. We follow classical machine-learning techniques for tabular data and later fine-tune them using Intel’s OneAPI AI Analytics Toolkit & libraries and Intel DevCloud which lets developers use a wide range of readily accessible tools.

Data Collection & Preparation

Any machine learning use case starts with data gathering. We take the help of open-source and readily available data from Kaggle datasets. This particular Heart Disease Data Set.

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer-valued 0 = no disease and 1 = disease.

Data Dictionary — The dataset consists of several medical predictor variables and one target variable, Outcome. This is a classification problem. The parameters are the following:

01. age
02. sex
03. chest pain type (4 values)
04. resting blood pressure
05. serum cholestoral in mg/dl
06. fasting blood sugar > 120 mg/dl
07. resting electrocardiographic results (values 0,1,2)
08. maximum heart rate achieved
09. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
14. The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

Now let us proceed with some Exploratory Data Analysis. Here we use Intel’s Modin which helps us get faster results than traditional pandas data frames. Modin works the same as pandas with additional enhanced execution time. We also import other commonly used data science libraries — numpy for n-dimensional array calculations, for visualization — matplotlib & seaborn.

Installing the Required Intel oneAPI AI Analytics Toolkit Package

!pip install modin
!pip install scikit-learn-intelex
# importing libraries 

import modin.pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import time

It’s time to use one more of Intel’s blazing performance-enhancing products — scikit-learn-intelex. This is easy and simple to use with a 2-liner code and needs no change to be made to existing code. The following code shall patch and provide the accelerated power of execution time. After the patch is included, as usual, the sklearn library is imported.

from sklearnex import patch_sklearn
patch_sklearn()

For EDA, we first read the data and try to get some insights out of it. We tend to look out for the types of data it holds (strings, floats, integers, dates, etc.), check on the statistics, co-related parameters, null values, and other checks

df = pd.read_csv("../input/heart-disease-dataset/heart.csv")
df.info()
%time

Checking target column distribution

# checking dataset is balanced or not
target_true_count = len(df.loc[df['target'] == 1])
target_false_count = len(df.loc[df['target'] == 0])
target_true_count, target_false_count

(526, 499)

We can conclude the dataset is balanced.

# plotting variation graphs for each property
df.hist(figsize = (30,30))

Plotting the Correlation Graph

corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
%time
df.columns
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
dtype='object')

Checking if data has 0 values present

print("Age: {0}".format(len(df.loc[df['age'] == 0])))
print("gender: {0}".format(len(df.loc[df['sex'] == 0])))
print("chest pain type: {0}".format(len(df.loc[df['cp'] == 0])))
print("resting blood pressure: {0}".format(len(df.loc[df['trestbps'] == 0])))
print("serum cholestoral: {0}".format(len(df.loc[df['chol'] == 0])))
print("fasting blood sugar: {0}".format(len(df.loc[df['fbs'] == 0])))
print("resting electrocardiographic results: {0}".format(len(df.loc[df['restecg'] == 0])))
print("maximum heart rate achieved: {0}".format(len(df.loc[df['thalach'] == 0])))
print("exercise induced angina: {0}".format(len(df.loc[df['exang'] == 0])))
print("oldpeak : {0}".format(len(df.loc[df['oldpeak'] == 0])))
print("the slope of the peak exercise ST segment: {0}".format(len(df.loc[df['slope'] == 0])))
print("number of major vessels (0-3) colored by flourosopy: {0}".format(len(df.loc[df['ca'] == 0])))
print("thal: {0}".format(len(df.loc[df['thal'] == 0])))
Age: 0
gender: 312
chest pain type: 497
resting blood pressure: 0
serum cholestoral: 0
fasting blood sugar: 872
resting electrocardiographic results: 497
maximum heart rate achieved: 0
exercise induced angina: 680
oldpeak : 329
the slope of the peak exercise ST segment: 74
number of major vessels (0-3) colored by flourosopy: 578
thal: 7

preparing the data

from sklearn.model_selection import train_test_split
feature_columns = ['age', 'sex', 'cp','trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang' , 'oldpeak', 'slope', 'ca', 'thal']
predicted_class = ['target']
%time
X = df[feature_columns]
y = df[predicted_class]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=10)
%time
X_train.shape, y_train.shape, X_test.shape, y_test.shape

filling in 0 values

from sklearn.impute import SimpleImputer

fill_values = SimpleImputer(missing_values=0, strategy="mean")

X_train = fill_values.fit_transform(X_train)
X_test = fill_values.fit_transform(X_test)
%time
CPU times: user 10 µs, sys: 1e+03 ns, total: 11 µs
Wall time: 16.7 µs

Training the Model using RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)

model = random_forest_model.fit(X_train, y_train)
%time
CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.82 µs
predict_train_data = model.predict(X_test)

from sklearn import metrics

print("Accuracy Using Intel OneAPI = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
%time
Accuracy Using Intel OneAPI = 0.961
CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 8.58 µs
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predict_train_data)
cm
%time
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 10.3 µs
import joblib
joblib.dump(model, "./random_forest_heart_Oneapi.joblib")
%time

Training the Model using XGBoost

from xgboost import XGBClassifier
xg_model = XGBClassifier(random_state=42)

model_1 = xg_model.fit(X_train, y_train)
%time
[18:58:17] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.82 µs
predict_train_data = model_1.predict(X_test)

print("Accuracy Using OneAPI= {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
%time
Accuracy Using OneAPI= 0.951
CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.54 µs
cm = confusion_matrix(y_test, predict_train_data)
cm
%time

Evaluation Metrics

As the score appears, the model performs exceptionally well with the help of the Intel One API AI Analytics Toolkit and Libraries. We display results from other metrics available as well to take out precision, recall, f1-score & confusion matrix.

Saving the model, with the help of joblib in pkl format to be used later for prediction.

joblib.dump(model_1, "./XGboost_Oneapi.joblib")
%time
from sklearnex import unpatch_sklearn
unpatch_sklearn()

Conclusion

After the model is ready, it is saved and ready for deployment along with machine learning pipelines being created which are connected to API endpoints. Using Intel’s AI libraries has been a boon in performance and execution times. A link to our previous article covering the overview of our problem statement and API structure is given below.

--

--