Risk Prediction of Diabetes at an Early Stage using Machine Learning Approach

Sylvia Burris
Analytics Vidhya
Published in
4 min readJan 12, 2021

Diabetes is the fastest growing chronic life-threatening diseases that affect more than 422 million people worldwide. Diabetes type 2 is caused mainly by environmental factors and lifestyle choices. It is a slow-developing disease and starts developing metabolic indicators well before they ever develop into disease and is usually diagnosed by performing a fasting sugar test. The dataset used here contains reports of diabetes-related symptoms of 520 patients. It includes data about people for example their ages, sex and symptoms that may cause diabetes. The dataset was created from a direct questionnaire, and filled out under the supervision of a doctor, from Sylhet Diabetes hospital of Sylhet, Bangladesh.

For this report, other metabolic indicators were considered to diagnose the pre-diabetic people. The diabetes medical dataset (Early Stage diabetes dataset) has been collected from the University of California, Irvine (UCI) machine learning repository.

The study question of this project is; Can both common and less common diabetes symptoms be utilized for its early prediction?

Data description:

The dataset did not have any missing values. The data shape of this dataset has 16 attributes that will be used to predict the outcomes, class variable positive(which indicates that an individual is diabetic and class variable Negative( which indicates that an individual doesn’t have diabetes). We have 16 attributes that were used to predict the diabetic Class. All the attributes except Age have categorical data with two unique outcomes. The ages of the patients ranged between 16 and 90 years. There are no missing values in the data set.

Data set description

df.head(5)

61.5% of the patients had diabetes and 38.5% didn’t have diabetes. 37% and 67% of the patients were female and male respectively. 90% and 45% of the women and men had diabetes, respectively.

Data manipulation:

The dataset was transformed to numeric labels from non-numeric labels to prepare data for machine learning and correlation functions.

from sklearn import preprocessing

import pandas as pd

from sklearn import preprocessing

import pandas as pd

label = preprocessing.LabelEncoder()

def Categorical_label (col):

return label.fit(col).transform(col)

df= df.apply(Categorical_label)

df.head()

# Correlation Matrix Heatmap

f, ax = plt.subplots(figsize=(28, 10))

corr = df.corr()

hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap=”coolwarm”,fmt=’.2f’,

linewidths=.05)

f.subplots_adjust(top=0.93)

t= f.suptitle(‘Correlation matrix heatmap for the early diabetes prediction dataset’, fontsize=14)

from seanborn

Polyuria and Polydipsia had the highest correlation (0.67 and 0.65 correlation coefficients respectively) with diabetes. Age had a low correlation (0.11 correlation coeffient) with diabetes. Age was categorized(1.15–25, 2. 26–35, 3.36–45, 4.46–55,5.56–65, 6.above 65) to investigate its relationship with Diabetes further.

Category= pd.cut (df[‘Age’], bins=[15,25,35,45,55,65,90], labels=[‘1’, ‘2’, ‘3’, ‘4’, ‘5’,’6'])

df. insert(7, ‘Age Group’, Category)

The patient age distribution had more people in the age group 4 (46–55 years). There is no statistically significant relationship between age group and diabetes because diabetes is distributed similarly in all the age groups. A chi-squared test was performed which resulted in a p-value of 0.076. The p-value>0.05, so we fail to reject the null hypothesis that there is no relationship between age group and Diabetes.

After a chi-squared test was performed to investigate the relationship between polyuria and diabetes, the p-value obtained was less than 0.05. So, we reject the null hypothesis and conclude that there is a significant relationship between polyuria and diabetes.

Using multi-layer perceptron to predict diabetes

source: Predicting Monthly Electricity Demand Using Soft-Computing Technique (Isaac Kofi Nti, Asafo-Adjei Samuel)

The Multi-layer Perceptron was the neural network of choice because of its extensive use in the medical field for predicting complex disease processes. The data was divided into training and test splits to train and evaluate the performance of the neural network. Hidden layer sizes were adjusted to 3 layers of 11 nodes each and used the default maximum iterations of 1000.

mlp = MLPClassifier(hidden_layer_sizes=(11,11,11), max_iter=1000)

mlp.fit(X_train, y_train.values.ravel())

After training the data, predictions of the test data were made and finally the performance of the algorithm was evaluated. From the confusion matrix, 6 out of 104 patients were misclassified and resulted in accuracy of 95% and F-score of 94.5%, which are good prediction indicators.

MLP machine learning model results

Conclusion: From the confusion matrix, 6 out of 104 patients were misclassified, which resulted in an accuracy of 94% and F-score of 94.5%. These are good prediction indicators. Both common and less common diabetes symptoms can be utilized for the early prediction of diabetes using a machine learning approach. MLP machine learning model is a good fit because of its accuracy.

Click to get the Github notebook:

--

--