Predict Diabetes With Machine Learning Algorithms

Nutan
7 min readNov 15, 2021

--

In this blog, our objective is to predict based on diagnostic measurements whether a patient has diabetes or not.

Photo by Diabetesmagazijn.nl on Unsplash

Diabetes is a common chronic disease and it is a great threat to human health.

The characteristic of diabetes is that the blood glucose is higher than the normal level, which is caused by defective insulin secretion or its impaired biological effects, or both.

Diabetes can lead to chronic damage and dysfunction of various tissues, especially eyes, kidneys, heart, blood vessels and nerves. Diabetes can be divided into two categories, type 1 diabetes (T1D) and type 2 diabetes (T2D).

Patients with type 1 diabetes are normally younger, mostly less than 30 years old. The typical clinical symptoms are increased thirst and frequent urination, high blood glucose levels. This type of diabetes cannot be cured effectively with oral medications alone and the patients are required insulin therapy.

Type 2 diabetes occurs more commonly in middle-aged and elderly people, which is often associated with the occurrence of obesity, hypertension, dyslipidemia, arteriosclerosis, and other diseases.

Pima Indians Diabetes Data set information?

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. In this database there are nine columns:

1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)²)
7. DiabetesPedigreeFunction: Diabetes pedigree function
8. Age: Age (years)
9. Outcome: Class variable (0 or 1)

This dataset is taken from UCI Machine Learning Repository. If you want to download click link below:

Download

Import Libraries

import pandas as pd
import numpy as np

Load the data

df = pd.read_csv("input/pima-indians-diabetes.csv")
df.head()

Output:

Dataset details

Shape of dataframe

df.shape

Output: (768, 9)

View dataframe information

df.info()

Output:

View dataset column names

df.columns

Output:

Statistical summary of data

df.describe()

Output:

Data pre-processing

Check null records

df.isnull().values.any()

Output: False

df.isnull().sum()

Output:

df.isnull().head()

Output:

Drop records if there is null values

df.dropna(axis = 0, inplace = True)df.shape

Output: (768, 9)

Visualize data

import seaborn as sns
import matplotlib.pyplot as plt

View diabetic and non diabetic patient

df['Outcome'].value_counts()

Output:

As per our dataset, 560 patients having no diabetes and 268 patients have diabetes.

Plot diabetic patient distribution

plt.figure(figsize =(8, 6))
f = sns.countplot(x = 'Outcome', data = df)
f.set_title("Diabetic Patient Distribution")
f.set_xticklabels(['No', 'Yes'])
plt.xlabel("");

Output:

Correlation between all variables

Correlation is used to denote association between two quantitative variables means relationship between two variables is called correlation. The degree of association is measured by a correlation coefficient. A correlation coefficient matrix is a simple table to summarize the correlations between all variables. Correlation give us a basic understanding of the relationship among variables of the dataset.

df.corr()

Output:

Correlation plot for all variables

plt.figure(figsize =(10, 6))
sns.heatmap(df.corr(),annot=True)

Output:

Distribution of glucose of diabetic patients

When value of Glucose is higher than 110, patients are more like to be diabetic. Same when value of Insulin is higher than 150, patients are more like to be diabetic. Let us plot.

df['glucose_category'] = pd.cut(df['Glucose'], bins=list(np.arange(45, 200, 65)))
df['glucose_category']

Output:

df.head()

Output:

count_of_positive_diabetes_diagnosed = df[df['Outcome'] == 1].groupby('glucose_category')['Glucose'].count()
count_of_positive_diabetes_diagnosed

Output:

count_of_positive_diabetes_diagnosed.plot(kind='bar')
plt.title('Glucose Distribution of Diabetic Patients')

Output:

Distribution of glucose

fig = plt.figure(figsize=(10,6))
sns.distplot(df['Glucose'], kde=True)
plt.show()

Output:

Age distribution of diabetes patients

df['age_category'] = pd.cut(df['Age'], bins=list(np.arange(20, 80, 10)))
df['age_category']

Output:

count_of_positive_diabetes_diagnosed_by_age = df[df['Outcome'] == 1].groupby('age_category')['Age'].count()
count_of_positive_diabetes_diagnosed_by_age

Output:

count_of_positive_diabetes_diagnosed_by_age.plot(kind='bar')
plt.title('Age Distribution of Diabetic Patients')

Output:

Distribution of Age

fig = plt.figure(figsize=(10,6))
sns.distplot(df['Age'], kde = True)
plt.show()

Output:

Distribution of BMI by Diabetes

fig = plt.figure(figsize=(10,6))
sns.distplot(df['BMI'], kde = True)
plt.show()

Output:

Plot Diabetes Patients

df[df['Outcome'] == 1].hist(figsize = (20,20))
plt.title('Diabetes Patients')

Output:

Plot Non Diabetes Patients

df[df['Outcome'] == 0].hist(figsize = (20,20))
plt.title('Non Diabetes Patients')

Output:

Create feature and target columns

x = df.iloc[:,0:8]
y = df.iloc[:,8]

View x and y

x.head()

Output:

y.head()

Output:

Split data into train and test

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=42)

x_train.shape

Output: (614, 8)

x_train.shape

Output: (614,)

x_test.shape

Output: (154, 8)

y_test.shape

Output: (154,)

Rescale training and test data

from sklearn.preprocessing import StandardScalerss = StandardScaler()x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)

Logistic Regression

from sklearn.linear_model import LogisticRegressionlr = LogisticRegression()

Train the model

lr.fit(x_train, y_train)

Predict test data

predictions = lr.predict(x_test)

View predicted and actual value

print("Predicted value: ", predictions)
print("Actual value: ", y_test)

Output:

View the accuracy

from sklearn.metrics import accuracy_scoreprint('Accuracy: ', accuracy_score(y_test, predictions))

Output:

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

Create RandomForestClassifier model

rfc = RandomForestClassifier()

Train model

rfc.fit(x_train, y_train)

Predict test data

rfcpredictions = rfc.predict(x_test)

View predicted and actual value

print("Predicted value: ", rfcpredictions)
print("Actual value: ", y_test)

Output:

View the accuracy

​print('Accuracy: ', accuracy_score(y_test, rfcpredictions))

Output:

SVC (support Vector Classifier)

from sklearn.svm import SVC

Instantiate the model

svc = SVC()

Train the model

svc.fit(x_train, y_train)

Predict the test data

svcpredictions = svc.predict(x_test)

View the predicted and actual value

print("Predicted value: ", svcpredictions)
print("Actual value: ", y_test)

Output:

View the accuracy

print('Accuracy: ', accuracy_score(y_test, svcpredictions))

Output:

KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

Instantiate the model

kn = KNeighborsClassifier()

Train the model

kn.fit(x_train, y_train)

Predict the test data

knprediction = kn.predict(x_test)

View the actual and predicted value

print("Predicted value: ", knprediction)
print("Actual value: ", y_test)

Output:

View the accuracy

print('Accuracy: ', accuracy_score(y_test, knprediction))

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifierdtc = DecisionTreeClassifier(random_state=0)
dtc.fit(x_train, y_train)
dtcprediction = dtc.predict(x_test)

View the actual and predicted value

print("Predicted value: ", dtcprediction)
print("Actual value: ", y_test)

Output:

View the accuracy

print('Accuracy: ', accuracy_score(y_test, dtcprediction))

Output:

Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifiergbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)
gbcprediction = gbc.predict(x_test)

View the actual and predicted value

print("Predicted value: ", gbcprediction)
print("Actual value: ", y_test)

Output:

View the accuracy

print('Accuracy: ', accuracy_score(y_test, gbcprediction))

Output:

Model performance summary

Accuracy of six models

print('Logistic Regression: ', accuracy_score(y_test, predictions))
print('Random Forest Classifier: ', accuracy_score(y_test, rfcpredictions))
print('Support Vector Classifier: ', accuracy_score(y_test, svcpredictions))
print('KNeighbors Classifier: ', accuracy_score(y_test, knprediction))
print('Decision Tree Classifier: ', accuracy_score(y_test, dtcprediction))
print('Gradient Boosting Classifier: ', accuracy_score(y_test, gbcprediction))

Output:

We have tried 6 different models. The performance scores of each model are list above. According to the prediction, the Logistic Regression model has the highest accuracy score of 78%.

Thanks for reading.

--

--

Nutan

knowledge of Machine Learning, React Native, React, Python, Java, SpringBoot, Django, Flask, Wordpress. Never stop learning because life never stops teaching.