Predicting Diabetes Using PIMA Dataset

Choosing the best machine-learning algorithm to predict diabetes

Akshay Gautam
CodeX
7 min readAug 25, 2021

--

Diabetes is a disease in which the body develops an inability to produce enough insulin. Insulin is the hormone produced by the pancreas. It is responsible for using glucose and transferring it to the cells of the body to produce energy. A low level of insulin production can increase the glucose (Blood Sugar) in the body. A high glucose level can adversely affect an individual's health, and it is also associated with damage to the organs and tissues. There are three types of diabetes such as Type -1, Type — 2, and Gestational diabetes.

Diabetes in India is prevalent. 16% of diabetic patients in the world are from India. India has the second-highest number of diabetic patients globally, with 77 million people suffering from diabetes. According to the IDF (International Diabetes Federation), the number of patients with diabetes in India will increase to 134 million by the end of the year 2045. The above facts indicate that there is a considerable need to understand this disease in depth in India. That is where data analysis and machine learning play an important role.

This article will build a precise and accurate machine learning model that can predict whether the patient has diabetes based on different diagnostic measures included in the dataset and reduce false negatives.

Methodology

I used the CRISP-DM methodology for this analysis. CRISP-DM stands for Cross- Industry Process for Data Mining. This methodology is widely used for data mining projects. It provides us a simple and structured approach. This methodology contains six steps:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Data Description

I used the PIMA diabetes dataset. This dataset is downloaded from Kaggle. This dataset contains information about Indian females aged at least 21 and above, and it includes nine attributes and 768 instances. The attributes are Age, Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and Outcome. All the attributes are numeric except the outcome variable. The data type of outcome variable is boolean.

Missing Values

I used the pandas profiling module to generate an exploratory data analysis report. The below report shows the missing values of the dataset generated by pandas profiling.

Profiling Report

The above report shows that pregnancies, blood pressure, skin thickness, insulin, and BMI have lots of values as zero. For the pregnancies attribute, the value can be zero as this attribute represents the number of pregnancies. However, the values cannot be zero for attributes like blood pressure, skin thickness, insulin, and BMI. These zero values can be the result of an error in data collection or testing devices error. For the attributes mentioned above, zero represents the missing values. If I exclude the ‘Pregnancies’ attribute, there are 647 missing values in the dataset.

Outliers

An outlier is a data point that does not possess the qualities of other data points or observations. To put it straight, it is a horse in the crowd of donkeys. Finding outliers and removing them with a proper method can significantly improve the quality of our machine learning model.

I used the box plot to represent all the numerical data points through their quartiles, and analysts can detect the outlier by looking at the plot.

The below image shows the box plots for six different attributes of the dataset:

Box Plot for Six Different Attributes

The points, which are outside all the data points, are the outliers. Blood Pressure, Insulin, and Diabetes Pedigree Function seem to have lots of outliers. We have to remove these outliers to conduct a proper analysis and to build a perfect model.

Data Processing

As mentioned above, there are lots of zeros in the dataset. To deal with it, we deleted all the zeros and replaced them with the blank. I replaced all these blanks with the median of that particular attribute. I used the median because almost all the attributes are skewed. When data is skewed, the median is the optimal choice.

For outliers, I used the Z-score method to detect the outliers. I removed all the rows with the outliers because they were not representing the correct data. As we can see in our box plot, there are many BMI values of more than 50. That is highly unlikely.

Visualization

1. Women with a high number of pregnancies are more prone to Diabetes?

The below graph is a count plot of the pregnancies vs. outcome attributes. This plot represents the comparison between the number of pregnancies of diabetic patients and non — diabetic patients.

As seen in the plot, there are more non-diabetic patients than diabetic patients for fewer pregnancies. Moreover, if we check the patients with pregnancies of more than seven, there are more diabetic patients than non-diabetic patients. Hence, we can conclude that more pregnancies can be highly associated with higher chances of getting diabetic.

2. Do older women have higher chances of getting diabetic?

The below image is the bar graph of the different age groups of diabetic women. This bar graph is created in Tableau. I categorized the age and converted it into 5 different groups. These age groups are 21–24 years old, 25–30 years old, 31–40 years old, 41–55 years old, and 55+ years old.

This graph provides us significant insight. If you look at the graph, you will see that from age 31 to 55, there are a high number of diabetic women. We can conclude that middle-aged women can be at a higher risk of getting diabetic. It also shows us that age can be a significant factor in predicting diabetes.

Model Building

I used four different machine learning algorithms as K-NN, Random Forest, Logistic Regression, and Naïve Bayes. Initially, I split our data into a training set and test set. There are 719 instances in processed data, and I chose 575 instances for training purposes and 144 instances for testing purposes, making our data split into 20–80 split.

I chose accuracy, sensitivity, specificity, F1 score, Precision, ROC, and AUC as performance measures for evaluation. The results are shown below for each algorithm.

K-NN (K-Nearest Neighbor)

We chose K = 5 as the value of the nearest neighbor for this algorithm. The total instances for the test set are 144.

  1. Accuracy- 0.75
  2. Specificity- 0.86
  3. Sensitivity- 0.63
  4. F1 Score- 0.64
  5. Precision- 0.66
  6. AUC- 0.81
  7. False Negatives- 22

Random Forest

Below are the evaluation measures for Random Forest:

  1. Accuracy- 0.77
  2. Specificity- 0.88
  3. Sensitivity- 0.51
  4. F1 Score- 0.57
  5. Precision- 0.65
  6. AUC- 0.85
  7. False Negatives- 12

Naive Bayes

  1. Accuracy- 0.78
  2. Specificity- 0.83
  3. Sensitivity- 0.71
  4. F1 Score- 0.72
  5. Precision- 0.72
  6. AUC- 0.85
  7. False Negatives- 16

Logistic Regression

  1. Accuracy- 0.74
  2. Specificity- 0.95
  3. Sensitivity- 0.37
  4. F1 Score- 0.51
  5. Precision- 0.80
  6. AUC- 0.82
  7. False Negatives- 33

Comparison

Accuracy

As we can see in the below bar graph, Naïve Bayes has the highest accuracy among others. This accuracy indicates that Naïve Bayes correctly classified a higher number of instances than other algorithms. At the same time, Logistic Regression shows the least accuracy with the value of 0.74. Thus, we will recommend Naïve Bayes to the researchers who are trying to predict diabetes. On the other hand, we drop the idea of using Logistic Regression to predict diabetes.

False Negatives

The below chart is a comparison of false negatives. As we can see from the graph, Naïve Bayes has fewer false negatives (16) than the other algorithms. On the other hand, the highest number of false negatives can be seen in Logistic Regression, with a value of 33. At the same time, the false negatives for K-NN and Random Forest are 21 and 21, respectively. Thus, we can conclude that Naïve Bayes is good for predicting diabetes.

Final Thoughts

In predicting diabetes, Naïve Bayes is performing better than the other algorithms, with an accuracy of 78%. According to this analysis, a higher number of pregnancies is associated with a higher chance of getting diabetic, and after pregnancy, there is a higher chance of getting diabetic. Women with high glucose, Blood Pressure, and Insulin have a higher chance of getting diabetic. Middle-aged women are more likely to have diabetes. Above mentioned findings can help the medical community understand diabetes in depth. More research needs to be done because it is a matter of many lives.

If you like this one, here are my other posts

--

--

Akshay Gautam
CodeX

I write about Personal Growth, Productivity, Business, Technologies, and Life. I work as a data analyst and I often like to take risks creating new things.