Analyzing Pima-Indian-Diabetes dataset

Ali Ashraf
Mar 12, 2021 · 7 min read

using Classification Techniques


Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough — or any — insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.

According to WHO about 422 million people worldwide have diabetes. Since diabetes affects a large population across the globe and the collection of these datasets is a continuous process and it comprises of various patient related attributes such as age, gender, symptoms, insulin levels, blood pressure, blood glucose levels, weight etc. We are working on Pima Indians Diabetes Dataset (PIDD), extracted from the University of California, Irvine (UCI) machine learning repository.


PIDD consists of several medical parameters and one dependent (outcome) parameter of binary values .This dataset is mainly for female gender and Description of dataset is as following

9 columns with 8 independent parameter and one outcome parameter with uniquely identified 768 observations having 268 positive for diabetes (1) and 500 negative for diabetes (0)

1. Pregnancies : Number of times pregnant

2. Glucose: Oral Glucose Tolerance Test result

The glucose tolerance test is a lab test to check how your body moves sugar from the blood into tissues like muscle and fat. The test is often used to diagnose diabetes.

How Test was performed

The most common glucose tolerance test is the oral glucose tolerance test (OGTT). Before the test begins, a sample of blood will be taken. You will then be asked to drink a liquid containing a certain amount of glucose (usually 75 grams). Your blood will be taken again every 30 to 60 minutes after you drink the solution.

3. BloodPressure: Diastolic Blood Pressure values in (mm Hg)

The diastolic reading, or the bottom number, is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen.

This is what your diastolic blood pressure number means:

  • Normal: Lower than 80
  • Stage 1 hypertension: 80–89
  • Stage 2 hypertension: 90 or more
  • Hypertensive crisis: 120 or more

Most people with diabetes will eventually have high blood pressure.

4. SkinThickness: Triceps skin fold thickness in (mm)

Skinfold thickness, so that a prediction of the total amount of body fat can be made. The triceps skinfold is necessary for calculating the upper arm muscle circumference. Its thickness gives information about the fat reserves of the body, whereas the calculated muscle mass gives information about the protein reserves.

For adults, the standard normal values for triceps skinfolds are 2.5mm (men) or about 20% fat; 18.0mm (women) or about 30% fat. Measurement half, or less, of these values represent about the 15th percentile and can be considered as either borderline, or fat depleted. Values over 20mm (men) and 30mm (women) represent about the 85th percentile, and can be considered.

5. Insulin: 2-Hour serum insulin (mu U/ml)

Insulin is a hormone that helps move blood sugar, known as glucose, from your bloodstream into your cells.

2-hour Serum Insulin: Greater than 150 mu U/ml relates to insulin therapy

Insulin therapy is a critical part of treatment for people with type 1 diabetes and also for many with type 2 diabetes. The goal of insulin therapy is to keep your blood sugar levels within a target range.

6. BMI: Body mass index

The Body Mass Index (BMI) provides a simple, yet accurate method of assessing whether a patient is at risk from either over-or-underweight. However, a proportionally greater lean body mass and/or skeletal frame size can contribute to apparent excess body weight. Many athletes, for example would be considered ‘overweight’, yet skin-fold tests show a sub-normal amount of adipose tissue. It can easily be calculated by dividing the patient’s weight (kg) by the square of their height (meters).

BMI= weight(kg)/[height(m)]²

7. DiabetesPedigreeFunction: Diabetes pedigree function

Diabetes Pedigree Function, it provided some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gave us an idea of the hereditary risk one might have with the onset of diabetes mellitus.

8. Age: Age in years

9. Outcome: Class 1 indicates person having diabetes and 0 indicates other.

Data Visualization:

Lets start visualization of dataset by finding correlation of every feature and outcome using visualizing techniques.

These statistic are generated using “df.describe()” method

  • count tells us the number of NoN-empty rows in a feature.
  • mean tells us the mean value of that feature.
  • std tells us the Standard Deviation Value of that feature.
  • min tells us the minimum value of that feature.
  • 25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers.
  • max tells us the maximum value of that feature.

Following histogram will helpful us to visualize relations between a single variable and the outcome .Below, we’ll see the relation between every parameter and outcome.

histo = df.hist(figsize = (10,10))

Data Pre-Processing:

For normalization first identify and remove any noise or incomplete data points

Zero Counts of BP :(35, 9),BMI : (11, 9) , Insulin : (374, 9) , Glucose : (5, 9) , skin : (227, 9)

Here come to approaches either completely remove missing values or normalize them with mean or median . By using approach one we will lose about 50% of our dataset then our training model will not have much data to be train . By using approach two there are some very crucial parameters as Glucose and Blood Pressure that will affect the most on the outcome .

By having these consideration we will use a hybrid approach , remove missing values of the parameters that will affect the most on the outcome and normalize others using mean or median .

## The median is the middle point of a number set, in which half the numbers are above the median and half are below
## replace null value with median
# pre-process Blood Pressure , BMI & Glucose invalid values
# as blood pressure & glucose are critical for determing diabeties it should not be invalid
# their null fraction is very small so it should better to remove invalid entries
df_copy=df_copy[(df_copy[‘BloodPressure’]!=0) & (df_copy[‘BMI’]!=0) & (df_copy[‘Glucose’]!=0)]

Now we have clean our data let divide it into Training 70% and Testing 30% dataset using sk-learn train split technique

from sklearn.model_selection import train_test_split

Classification Models:

We can now train our model . We will be using 5 different classification algorithms. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code.

Naïve Bayes

from sklearn.naive_bayes import GaussianNB
Accuracy : 0.7752293577981652

K-Nearest Neighbor

# KNN Implementationknn=KNeighborsClassifier(n_neighbors=11),y_train)predicted_knn=knn.predict(X_test)cm_knn=metrics.confusion_matrix(y_test,predicted_knn)accuracy_knn=metrics.accuracy_score(y_test,predicted_knn)
# k=3 (accuracy  0.7385321100917431) k=5 ( Accuracy :0.7522935779816514 )# k=11 (Accuracy :  0.7706422018348624) k=13 ( Accuracy :  0.7614678899082569 )Best at k=11

Decision Tree

# decision treed_tree=tree.DecisionTreeClassifier(),y_train)predicted_tree=d_tree.predict(X_test)accuracy_tree=metrics.accuracy_score(y_test,predicted_tree)Accuracy : 0.6972477064220184

Logistic Regression

# Logistic regressionlogisticRegr= LogisticRegression(),y_train)predict_lg = logisticRegr.predict(X_test)accuarcy_lg=metrics.accuracy_score(y_test,predict_lg)Accuracy : 0.7752293577981652

Linear Discriminant Analysis

lda=LinearDiscriminantAnalysis(),y_train)predict_lda=lda.predict(X_test)accuracy_lda=metrics.accuracy_score(y_test,predict_lda)Accuracy : 0.7798165137614679


On running experiments on the training set obtained, the training models were compared based on the performance of each of the classifier algorithms. The comparison of the previously mentioned accuracy measures that were finally obtained are as given below

Below table shows that instances correctly classifies and the instances that are incorrectly classified are obtained for each of the classification algorithms on the 30% test data (218 ). This matrix describes the performance of the classification model and allows for the visualization of performance of the algorithm.

For future work, the same method could be considered and many other machine learning classifiers algorithms could be considered to compare the most accurate one. This method can also be implemented on various other disease and medical datasets.

GitHub Repository

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…