Exploratory Data Analysis(EDA) and Classification on PIMA Indian Diabetes DataSet

Ayushi Aggarwal

Published in

crossML Blog

11 min readSep 27, 2022

What we are going to do is …

What is EDA and Why is it a necessary part of data analysis?
Understanding the PIMA Indian Diabetes Dataset.
EDA on the dataset which include: Univariate analysis,bivariate analysis, handling missing values and outliers .
Classification of data using the best classification algorithm.

WHAT IS EDA

EDA is the first and the most important step in the preparation of data , for its further analysis, so as to discover patterns,check for assumptions, missing data , spot anomalies and outliers .It is basically used to get insights from the data,often using statistical graphs and other data visualization methods .

We can use a statistical model for regression/classification ,depending upon the type of data.But that comes at a later stage . EDA is used for seeing what the data can tell us even before modelling , so as to get better results.

Once EDA is complete and insights are drawn ,then data can be further used for modelling.

Understanding the Dataset

We have the PIMA Indian Diabetes dataset (here is the link to the dataset) which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.It contains information about 768 women,atleast 21 years of age of PIMA Indian heritage.The outcome tested was Diabetes,for which 258 tested positive and 500 tested negative.

We have one target variable:

Outcome:1 (if having diabetes) and 0 (if not)

We have 8 explanatory variables(all numeric) which are:

Pregnancies : Number of times pregnant.

Glucose: Oral glucose tolerance test — OGTT (two hour plasma glucose concentration after 75g anhydrous glucose in mg/dl)

BloodPressure : Blood Pressure (Diastolic Blood Pressure in mmHg)

SkinThickness : Triceps skin fold thickness (in mm)

Insulin : 2 h serum insulin in mu U/ml

BMI : Body Mass Index in kg/m2

DiabetesPedigreeFunction : Function that represents how likely they are to get the disease by extrapolating from their ancestor’s history.

Age : Age in years.

Now we have an understanding of the dataset ,so let us get started with working on it.

Github Link

Here is the LINK for the git-hub repository for the complete code.

Performing EDA

Importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score,f1_score

Reading the data frame and viewing the first 5 rows

data=pd.read_csv(‘/home/crossml/Downloads/data_b.csv’)
df=data.copy(deep=True)
data.head(5)

Checking the shape of our data

data.shape

(768, 9)

Checking for null values

data.isnull().sum()

There are some missing values in each feature.
Now we generate descriptive statistics,i.e. a summary of the data including central tendency, quartiles , count etc…

Descriptive statistics include summary such as the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

data.describe()

Here ,Min value in Glucose,BloodPressure,SkinThickness,Insulin and BMI columns are 0,which isn’t possible ,hence the zeros represent the missing values here.
Now, replacing the 0’s with NaN so that they can be treated as missing values.

cols = [“Glucose”,”BloodPressure”,”SkinThickness”,”Insulin”,”BMI”]
df[cols] = df[cols].replace({‘0’:np.nan, 0:np.nan})

Now we see the summary of data

df.describe()

Visualizing the no. of missing values by a bar graph

import missingno as msno
msno.bar(df)

Insulin has max. no of missing values ,followed by tricep skin thickness.

Correlation Matrix

corr = df.corr()
sns.heatmap(corr, annot=True, square=True)
#plt.yticks(rotation=0)
plt.show()

We see there is no strong correlation between the features. The strongest of all are:

Glucose &insulin (0.59):A spike in insulin signals to the liver that blood glucose is also high.
Age &pregnancies (0.54) — Older women tend to have higher number of pregnancies,which is obvious
Glucose &outcome (0.48) — Women that have higher level of glucose tend to have higher level of insulin and have Diabetes.
Skin fold thickness &BMI (0.63) — Women with higher skin fold thickness value have higher BMI (and probably are overweight/obese)

Negative correlation:

Diabetes Pedigree Function &Pregnancies (-0.031)

In general, a correlation coefficient of >0.7 among two
features indicates the presence of multicollinearity,hence we can assure our data is free from multicollinearity.

Grouping the features by the Outcome

df.groupby(‘Outcome’).mean()

Diabetic women tend to have higher number of pregnancies, higher level of glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function than those who are non-diabetic and are also likely to be older .
Both groups have BMI much high than the normal range (18.5 -25)which indicates obesity.
Women having Diabetes are more likely to have an ancestral history of diabetes.
Women having Diabetes have an average of insulin that is higher than the normal range (16–166 muU/mL). ,whereas the ones who don’t have diabetes have average insulin in the normal range.
Both groups have average of glucose higher than the normal range (<= 100 mg/dL). It might indicate that some non diabetic women are in risk of have Diabetes in the future, specially the ones with higher levels of insulin .

Line plot for the same

df.groupby(‘Outcome’).mean().T.plot(figsize=(12,4))

Line plot helps us to visualize the same,that the mean value of each feature is more for diabetic women,than for non-diabetic ones.

Barplot for the same

#relation between each feature and the outcome variable by barplot..plt.figure(figsize=(20,10))
for i,col in enumerate(set(df.columns)-{“Outcome”}):
 plt.subplot(2,4,i+1)
 sns.barplot(data=df,x=”Outcome”,y=col,)
 plt.xlabel(“Outcome”, fontsize=15)
 plt.xticks(fontsize=10)
 plt.ylabel(col,fontsize=15)
 plt.yticks(fontsize=10)

Boxplots for the same

plt.figure(figsize=(20,20))
for i,col in enumerate(set(df.columns)-{'Outcome'}):
    plt.subplot(4,4,i+1)
    sns.boxplot(data=df,x='Outcome', y=col )
    plt.xlabel('Outcome', fontsize=15)
    plt.xticks(fontsize=10)

Just to make sure out data isn’t imbalanced,we plot the count plot for the output variable

sns.countplot(x=’Outcome’,data=df)

**No of diabetic and non-diabetic women**

Hence,the dataset isn’t imbalanced.

Plotting the distribution of data(Univariate plotting)

Now we have to impute missing values but for that first see the type of our data distributions to decide how can we impute it..
The imputation method should be decided after considering the distribution of data: normal distribution or skewed distribution (be it right-skewed or left-skewed).

plt.figure(figsize=(20,25))
for i,col in enumerate(set(df.columns)-{‘Outcome’}):
 plt.subplot(6,4,i+1)
 sns.distplot(df[col])
 plt.xlabel(col, fontsize=15)
 plt.xticks(fontsize=15)

The distribution for BP ,Skin Thickness and BMI is normal,while for all the other features ,it is skewed ..
Hence for these ,we can replace the missing values by mean ,and for rest by the median…

df1=df.copy(deep=True)
for column in df1[[‘BloodPressure’, ‘BMI’]]:
 df1[column]=df1[column].fillna(df1[column].mean())
for column in df1[[‘Pregnancies’,’Glucose’,’DiabetesPedigreeFunction’,’Age’]]:
 df1[column]=df1[column].fillna(df1[column].mean())

Since Skin Thickness and Insulin have a large no. of missing values so we use KNN Imputer for these features,as using mean/median is leading to a lot of outliers in the data.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5, weights=’distance’, metric=’nan_euclidean’,)
imputed_data = imputer.fit_transform(df1) 
df2 = pd.DataFrame(imputed_data)
df2.columns = df1.columns

Boxplots to check for outliers(after filling the missing values)


plt.figure(figsize=(15,8))
for i,col in enumerate(set(df2.columns)-{‘Outcome’}):
 plt.subplot(2,4,i+1)
 sns.boxplot(data=df2,x=col)
 plt.xlabel(col, fontsize=15)
 plt.xticks(fontsize=10)

The distribution of Age is highly skewed which means most of the women were young in age.

Also,we see outliers for every feature
Outliers are values within a dataset that vary greatly from the others i.e. they’re either much larger, or significantly smaller.
We have to remove/replace the outliers to get better accuracy.

def detect_outliers(df2):
    outliers= pd.DataFrame(columns=["Feature","No.of Outliers","Handled?"])
    for col in list(set(df2.select_dtypes(include=np.number).columns)-{'Outcome'}):
        q1 = df2[col].quantile(0.25)
        q3 = df2[col].quantile(0.75)
        iqr = q3 - q1
        low = q1 - (1.5*iqr)
        high = q3 + (1.5*iqr)
        n = df2.loc[(df2[col] < low) | (df2[col] > high)].shape[0]
                   
        df2.loc[(df2[col] < low),col] = low
        df2.loc[(df2[col] > high),col] = high
 #let's fix them
#df[col] = np.where((df[col] > fence_high) | (df[col] > fence_high),df[col].median(),df[col])
        outliers = outliers.append({'Feature': col, "No.of Outliers": n ,"Handled?": df2[col].all() < high},ignore_index=True)
    return outliersdetect_outliers(df2)

Now,since each outlier is replaced ;making boxplots for each feature after treating the outliers.

plt.figure(figsize=(15,8))
for i,col in enumerate(set(df2.columns)-{‘Outcome’}):
 plt.subplot(2,4,i+1)
 sns.boxplot(data=df2,x=col)
 plt.xlabel(col, fontsize=15)
 plt.xticks(fontsize=10)

Hence,the data is free from outliers now.

Bivariate plots

Pair Plot:A pairs plot allows us to visualize the relationships between two variables as well as to see the distribution of each variable.

sns.pairplot(data=df2,kind=’scatter’)

There is a high linear relationship between Skin Thickness and BMI(they also had the highest correlation).
Insulin and Glucose levels show a high linear relationship(which was clear from the correlation matrix too)

Now ,to see the relationship between two variables with respect to a third variable ‘Outcome’,we use the parameter ‘hue’.

sns.pairplot(data=df2,hue=’Outcome’)

Observations from the pairplot :

Diabetic women tend to have higher value for each feature i.e. They are older in age,more obese,had more no. of pregnancies,had high BP,glucose and insulin levels.
Also,if we see the plots of Glucose levels with other features, we see higher Glucose levels are a key factor for diabetes,regardless of the other factors.i.e. if the women have higher glucose levels,she is most likely to have diabetes.

Kdeplot: It is a Kernel Distribution Estimation Plot which depicts the probability density function of the data variables.

sns.pairplot(data=df2,kind=’kde’)

Joint plot:A joint plot comprises three charts in one. The center contains the bivariate relationship between the x and y variables. The top and right-side plots show the univariate distribution of the variables, respectively.

sns.jointplot(x=’Glucose’, y=’Insulin’, data=df2)

It shows the linear relationship between Glucose and Insulin, also their univariate distributions, both of these are skewed to the right.

Now, we make the joint plot with respect to the variable 'Outcome' with the help of parameter 'hue'.

sns.jointplot(x=’BMI’, y=’SkinThickness’, data=df2,hue=’Outcome’)

It shows the linear relationship between BMI and Tricep Skin Thickness.
Diabetic women are likely to be more obese and have high tricep skin thickness,also their univariate distributions which shows both the features have a skewed distribution .

Checking feature importance

We can check the importance of each feature by using the Extra Tree Regressor.

from sklearn.ensemble import ExtraTreesRegressor 
x=df2.drop("Outcome", axis=1)
y = df2.Outcome
model.fit(x,y)
print(model.feature_importances_)output:[0.09280622 0.25732228 0.08154691 0.08971294 0.12989174 0.12811934 0.09325013 0.12735045]#visualizing by bar graphfeature_imp = pd.DataFrame({‘Value’: model.feature_importances_, ‘Feature’: x.columns})
plt.figure(figsize=(5, 5))
sns.barplot(x=”Value”,y=”Feature”, data=feature_imp.sort_values(by=”Value”,ascending=False))
plt.title(‘Features importances’)
plt.show()

We can see from the bar graph that the Glucose levels has highest impact in predicting diabetes,followed by BMI,Age, Insulin…

PCA

Principal Component Analysis (PCA) is a dimensionality reduction method which converts a set of correlated variables into a set of uncorrelated variables ,called principal components ,while reducing the dimensionality of the dataset.

#scaling is required before applying pca
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCAscaler = StandardScaler()scaled_data = scaler.fit_transform(x)
scaled_data=pd.DataFrame(scaled_data)
scaled_datapca = PCA(n_components=4)principalComponents = pca.fit_transform(scaled_data)
principalDf = pd.DataFrame(data = principalComponents, columns = [‘PC1’, ‘PC2’,’PC3',’PC4'])print(‘Explained variation per principal component: {}’.format(pca.explained_variance_ratio_))

Explained variation per principal component: [0.32522962 0.18029869 0.14600216 0.11959445]

Classification

Applying model to the data…

For modelling,we first have to split our data into training and testing data set.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

Importing the grid search cv, which is used for selecting best model and the best parameters.It search for the best parameter values from the given list of parameters.

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

Giving the list of models and parameters ,for decision tree and random forest classifier .We always use un-scaled data for tree-based models such as decision tree and random forest model.

model_params={
    "decision_tree":{
        'model':DecisionTreeClassifier(random_state=0),
        'params':{
              'max_features': ['auto', 'sqrt', 'log2'],
              'max_depth':[3,4,5,6],
              'criterion':['gini', 'entropy', 'log_loss']
        }
    },
     "random_forest":{
        'model': RandomForestClassifier(),
        'params':{
            'n_estimators': [10,50,100], 
            'max_features': ['auto', 'sqrt', 'log2'],
            'max_depth':[3,4,5,6],
            'criterion':['gini', 'entropy', 'log_loss']
              
  
        }
    }
}scores_scaled = []
for mn,mp in model_params.items():
 clf = GridSearchCV(mp[‘model’],mp[‘params’],cv=5,return_train_score=False)
 clf.fit(x_train,y_train)
 scores_scaled.append({
 ‘model’ : mn,
 ‘best score’: clf.best_score_,
 “best params”: clf.best_params_
 })
 
df3 = pd.DataFrame(scores_scaled, columns=[“model”,”best score”,’best params’])
df3.sort_values(by=[‘best score’], ascending=False)
pd.set_option('display.max_colwidth', None)

Out of these,random forest model give the better accuracy.

Now we check the accuracy for other models,which should be applied on the scaled data .For this,first split the scaled data into training and testing partition.

# Splitting the dataset into the Training set and Test set for scaled data
from sklearn.model_selection import train_test_splitx_train1, x_test1, y_train1, y_test1 = train_test_split(scaled_data, y, test_size=0.2, random_state=0)

Giving the list of models and parameters ,for SVM,Logistic regression and gradient boosting model.We always use scaled data for these models ,for better accuracy.

model_params2={
 "svm":{
 'model': SVC(random_state=0),
 'params':{
 'C': [0.1, 1, 10, 100, 1000], 
 'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
 'kernel': ['rbf','linear']
 }
 },
 'logistic_regression':{
 'model': LogisticRegression(),
 'params':{'penalty':[ 'l2', 'elasticnet', None], 
 'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
 'C': [1,5,10]
 }
 },
"gradient_boosting":{
 'model': GradientBoostingClassifier(),
 'params':{
 'max_depth':[3,4,5],
 'n_estimators':[5,10,50,100],
 'criterion':['friedman_mse', 'squared_error', 'mse']
 }
 
 }
}scores_scaled2 = []
for mn,mp in model_params2.items():
 clf2 = GridSearchCV(mp[‘model’],mp[‘params’],cv=5,return_train_score=False)
 clf2.fit(x_train1,y_train1)
 scores_scaled2.append({
 ‘model’ : mn,
 ‘best score’: clf2.best_score_,
 “best params”: clf2.best_params_
 })
 
df4 = pd.DataFrame(scores_scaled2, columns=[“model”,”best score”,’best params’])
df4.sort_values(by=[‘best score’], ascending=False)

Since random forest got the most accuracy on the training data(76.22%)out of all the 5 algorithms, we will choose random forest classifier model for our data.

model_final = RandomForestClassifier(criterion= 'gini',max_depth= 4,max_features='log2',n_estimators=100)
model_final.fit(x_train, y_train)
y_pred1 = model_final.predict(x_test)
print('Accuracy of model for testing data is ',accuracy_score(y_test,y_pred1))
Output:Accuracy of model for testing data is  0.7857142857142857

For the testing data,we got an accuracy of 78.57% .

Heatmap

A visual representation of data in the form of a map or diagram, in which data values are represented as colours.

sns.heatmap(pd.DataFrame(confusion_matrix(y_test,y_pred1)),annot=True)
plt.show()

Cross validation to find mean accuracy

from sklearn.model_selection import cross_val_score
score = cross_val_score(model_final, x, y, cv=5)
accuracy_rate = []
accuracy_rate.append(score.mean())
print('Average accuracy of the final model is ',accuracy_rate)
Average accuracy of the final model is  [0.7565826330532213]

Average accuracy of the final model is 75.65%.

Summary

I first described the variables in the dataset to know what each variable represents .

Then I did exploratory data analysis ;find the summary of the dataset,histogram,to see the distribution of each variable, boxplots,to scheck for outliers,line plots ,count plots,bar plots…
Then i checked the data for missing values,and replaced the missing values with appropriate measures .

Plotted pairplots,to see relationship between each pair of variables,jointplots,to see univariate and bivariate relationship simultaneously,

Checked for importance of each feature using Extra Tree Regressor ,applied PCA to get a set of uncorrelated variables which explain the variation in the dataset.

Finally, to apply the best classification model,I used Grid Search CV to search for the model,along with its parameters,with best accuracy .
Random forest classifier model gave the best accuracy for the given dataset and the average accuracy for this model is 75.65%

Exploratory Data Analysis(EDA) and Classification on PIMA Indian Diabetes DataSet

Written by Ayushi Aggarwal