Hitaya OneAPI: Liver Disease Machine Learning Model Prediction Using Intel oneAPI

Raj
5 min readMay 7, 2023

--

In this article, we will be covering the approach to our solution under healthcare for underserved communities. This article continues that series, wherein we run down steps of making our predictions leveraging machine learning using the oneAPI analytics toolkit.

We will demonstrate the steps we used for the Liver Disease detection of a patient. We follow classical machine-learning techniques for tabular data and later fine-tune them using Intel’s OneAPI AI Analytics Toolkit & libraries and Intel DevCloud which lets developers use a wide range of readily accessible tools.

Data Collection & Preparation

Any machine learning use case starts with data gathering. We take the help of open-source and readily available data from Kaggle datasets. This particular dataset was collected from Indian Liver Patient Records Data Set.

Data Dictionary — The dataset consists of several medical predictor variables and one target variable, Outcome. This is a classification problem. The parameters are the following:

1)  Age of the patient
2) Gender of the patient
3) Total Bilirubin
4) Direct Bilirubin
5) Alkaline Phosphotase
6) Alamine Aminotransferase
7) Aspartate Aminotransferase
8) Total Protiens
9) Albumin
10) Albumin and Globulin Ratio
11) Dataset: field used to split the data into two sets (patient with liver disease, or no disease)

Let’s first Install the Intel oneAPI Toolkit packages

!pip install modin
!pip install scikit-learn-intelex

Now let us proceed with some Exploratory Data Analysis. Here we use Intel’s Modin which helps us get faster results than traditional pandas data frames. Modin works the same as pandas with additional enhanced execution time. We also import other commonly used data science libraries — numpy for n-dimensional array calculations, for visualization — matplotlib & seaborn.


# importing libraries

import modin.pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import time

It’s time to use one more of Intel’s blazing performance-enhancing products — scikit-learn-intelex. This is easy and simple to use with a 2-liner code and needs no change to be made to existing code. The following code shall patch and provide the accelerated power of execution time. After the patch is included, as usual, the sklearn library is imported.

from sklearnex import patch_sklearn
patch_sklearn()

For EDA, we first read the data and try to get some insights out of it. We tend to look out for the types of data it holds (strings, floats, integers, dates, etc.), check on the statistics, co-related parameters, null values, and other checks

df = pd.read_csv("../input/indian-liver-patient-records/indian_liver_patient.csv")
df.info()
df[df['Albumin_and_Globulin_Ratio'].isnull()]

Working on some data cleaning, and pre-processing

df["Albumin_and_Globulin_Ratio"] = df.Albumin_and_Globulin_Ratio.fillna(df['Albumin_and_Globulin_Ratio'].mean())
df.describe()

Plotting the Correlation Graph

corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
df.columns
# checking dataset is balanced or not
true_count = len(df.loc[df['Dataset'] == 1])
false_count = len(df.loc[df['Dataset'] == 2])

Checking if the dataset is balanced or not?

true_count, false_count

(416, 167)

Cleaning and Training the Data

from sklearn.model_selection import train_test_split
feature_columns = ['Age', 'Total_Bilirubin', 'Direct_Bilirubin','Alkaline_Phosphotase', 'Alamine_Aminotransferase',
'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio',
'Gender_Female', 'Gender_Male']
predicted_class = ['Dataset']
X = df[feature_columns]
y = df[predicted_class]
from imblearn.over_sampling import SMOTE
smk = SMOTE(random_state = 42)
X, y = smk.fit_resample(X,y)
X.shape, y.shape
%time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=10)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((582, 11), (250, 11), (582, 1), (250, 1))

Training the Model using RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)

model = random_forest_model.fit(X_train, y_train)
predict_train_data = model.predict(X_test)

from sklearn import metrics

print("Accuracy Using Intel OneAPI = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
Accuracy Using Intel OneAPI = 0.808
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 42.2 µs
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predict_train_data)
cm
%time
CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 8.58 µs

Training Model Using Intel Extension For XGBoost

from xgboost import XGBClassifier
from sklearn import metrics

xg_model = XGBClassifier(random_state=42)
model_1 = xg_model.fit(X_train, y_train)

predict_train_data = model_1.predict(X_test)
print("Accuracy Using Intel OneAPI = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
%time
[16:20:12] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Accuracy Using Intel OneAPI = 0.812
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.78 µs
from sklearn.ensemble import ExtraTreesClassifier

model_2 = ExtraTreesClassifier(random_state=123)
model_2.fit(X_train, y_train)
predict_train_data = model_2.predict(X_test)
print("Accuracy Using Intel OneAPI= {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
%time
Accuracy Using Intel OneAPI= 0.832
CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 8.11 µs

Evaluation Metrics

As the score appears, the model performs exceptionally well with the help of the Intel One API AI Analytics Toolkit and Libraries. We display results from other metrics available as well to take out precision, recall, f1-score & confusion matrix.

Saving the model, with the help of joblib in pkl format to be used later for prediction.

import joblib
joblib.dump(model, "./xtrees_liver_oneapi.joblib")
%time
from sklearnex import unpatch_sklearn
unpatch_sklearn()

Conclusion

After the model is ready, it is saved and ready for deployment along with machine learning pipelines being created which are connected to API endpoints. Using Intel’s AI libraries has been a boon in performance and execution times. A link to our previous article covering the overview of our problem statement and API structure is given below.

--

--