Building a Diabetes Predictor

A Machine Learning approach.

Aditya Mankar
Analytics Vidhya

--

In this project, the objective is to predict whether the person has Diabetes or not based on various features like Glucose level, Insulin, Age, BMI. We will use the Pima Indians dataset from the UCI Machine learning repository. We will develop this project in six steps which follows data gathering to model deployment.

Photo by Akash Deep on Unsplash

Motivation:

Diabetes is an increasingly growing health issue due to our inactive lifestyle. If it is detected in time then through proper medical treatment, adverse effects can be prevented. To help in early detection, technology can be used very reliably and efficiently. Using machine learning we have built a predictive model that can predict whether the patient is diabetes positive or not.

Step 0: Data gathering and Importing libraries.

All the standard libraries like numpy, pandas, matplotlib and seaborn are imported in this step. We use numpy for linear algebra operations, pandas for using data frames, matplotlib and seaborn for plotting graphs. The dataset is imported using the pandas command read_csv().

# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Importing dataset
dataset = pd.read_csv('diabetes.csv')

Step 1: Descriptive Analysis

# Preview data
dataset.head()
Data Preview
# Dataset dimensions - (rows, columns)
dataset.shape
output: (768, 9)# Features data-type
dataset.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
# Statistical summary
dataset.describe().T
Statistical Summary
# Count of null values
dataset.isnull().sum()
output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

Observations:

1. There are a total of 768 records and 9 features in the dataset.

2. Each feature can be either of integer or float data type.

3. Some features like Glucose, Blood pressure, Insulin, BMI have zero values which represent missing data.

4. There are zero NaN values in the dataset.

5. In the outcome column, 1 represents diabetes positive and 0 represents diabetes negative

Step 2: Data Visualizations

# Outcome countplot
sns.countplot(x = 'Outcome',data = dataset)
Outcome Countplot
# Histogram of each feature
import itertools
col = dataset.columns[:8]
plt.subplots(figsize = (20, 15))
length = len(col)
for i, j in itertools.zip_longest(col, range(length)):
plt.subplot((length/2), 3, j + 1)
plt.subplots_adjust(wspace = 0.1,hspace = 0.5)
dataset[i].hist(bins = 20)
plt.title(i)
plt.show()
Histogram of each Feature
# Pairplot 
sns.pairplot(data = dataset, hue = 'Outcome')
plt.show()
Pairplot of all features
# Heatmap
sns.heatmap(dataset.corr(), annot = True)
plt.show()
Heatmap of Feature correlation

Observations:

1. The countplot tells us that the dataset is imbalanced, as the number of patients who don’t have diabetes is more than those who do.

2. From the correlation heatmap, we can see that there is a high correlation between Outcome and [Glucose, BMI, Age, Insulin]. We can select these features to accept input from the user and predict the outcome.

Step 3: Data Preprocessing

# Replacing zero values with NaN
dataset[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]] = dataset[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]].replace(0, np.NaN)
# Count of NaN
dataset.isnull().sum()
Output:
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
# Replacing NaN with mean values
dataset["Glucose"].fillna(dataset["Glucose"].mean(), inplace = True)
dataset["BloodPressure"].fillna(dataset["BloodPressure"].mean(), inplace = True)
dataset["SkinThickness"].fillna(dataset["SkinThickness"].mean(), inplace = True)
dataset["Insulin"].fillna(dataset["Insulin"].mean(), inplace = True)
dataset["BMI"].fillna(dataset["BMI"].mean(), inplace = True)

In this dataset, the missing values are represented by zero values that need to be replaced. The zero values are replaced by NaN so that missing values can easily be imputed using the fillna() command.

# Feature scaling using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
dataset_scaled = sc.fit_transform(dataset_new)
dataset_scaled = pd.DataFrame(dataset_scaled)

We perform Feature scaling on the dataset using Minmaxscaler() so that it scales the entire dataset such that it lies between 0 and 1. It is an important preprocessing step for many algorithms.

# Selecting features - [Glucose, Insulin, BMI, Age]
X = dataset_scaled.iloc[:, [1, 4, 5, 7]].values
Y = dataset_scaled.iloc[:, 8].values
# Splitting X and Y
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42, stratify = dataset_new['Outcome'] )
# Checking dimensions
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)
Output:
X_train shape: (614, 4)
X_test shape: (154, 4)
Y_train shape: (614,)
Y_test shape: (154,)

In the feature correlation heatmap, we can observe that Glucose, Insulin, Age and BMI are highly correlated with the outcome. So, we select these features as X and the outcome as Y. The dataset is then split using train_test_split with an 80:20 ratio.

Step 4: Data Modelling

# Support Vector Classifier Algorithm
from sklearn.svm import SVC
svc = SVC(kernel = 'linear', random_state = 42)
svc.fit(X_train, Y_train)
# Making predictions on test dataset
Y_pred = svc.predict(X_test)

The Algorithm:

Credit: chrisalbon.com

Support Vector Classifier(SVC) is a type of supervised classification model whose objective is to classify the data based on a maximal margin hyperplane build using support vectors. This hyperplane is a decision boundary that classifies between various classes. It is build using support vectors, which are the outliers. The hyperplane which has the highest margin is selected as the decision boundary.

SVCs can classify linear as well as non-linear data using a kernel trick which implicitly maps the input to high dimensional vector spaces. This kernel trick converts the lower-dimensional feature space into higher dimensional feature space which is linearly separable. For example, data in a 2D may not be linearly separable but when it is converted into 3D using the kernel function it becomes linearly separable.

SVC has three main parameters that affect the performance of the model which are the kernel, gamma, C. The kernel parameter signifies the type of kernel which can be “Linear” for linearly separable data or “rbf”, “poly” for non linearly separable data. Gamma parameter is the kernel coefficient. As the value of gamma increases, it tries to exactly fit the dataset which gives generalization error and causes overfitting. C parameter is the cost of misclassification of the model. The high value of C gives you low bias and high variance whereas the low value of C gives you high bias and low variance.

Step 5: Model Evaluation

# Evaluating using accuracy_score metric
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy: " + str(accuracy * 100))Output:
Accuracy: 73.37662337662337
# Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
cm
Output:
array([[87, 13],
[20, 34]], dtype=int64)
# Heatmap of Confusion matrix
sns.heatmap(pd.DataFrame(cm), annot=True)
Heatmap of Confusion Matrix
# Classification report
from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_pred))
Output:
precision recall f1-score support

0.0 0.81 0.87 0.84 100
1.0 0.72 0.63 0.67 54

micro avg 0.79 0.79 0.79 154
macro avg 0.77 0.75 0.76 154
weighted avg 0.78 0.79 0.78 154

We have chosen three metrics accuracy_score, confusion matrix and classification report for evaluating our model.

Step 6: Model Deployment

Flask application

In this step, we will use Flask micro-framework to create a web application of our model. All the required files can be found in my GitHub repository here.

With this step, we have completed our project from data gathering to model deployment.

Further Applications:

A similar model can be built for a variety of diseases like breast cancer, malaria, etc in much detail and can be highly reliable once it gets high enough accuracy.

--

--