Predictive modelling of Diabetes Database

Murali Ambekar
The Startup
Published in
6 min readJun 30, 2020

--

This article explains the step by step modelling process of Pima Indians Diabetes Database from www.kaggle.com in order to predict diabetes patient using ML/DL models.

Step 1: Importing libraries

We will import pandas for data handling, numpy for matrix computation, matplotlib and seaborn for visualisation. The mlmodels is a custom class to train different ml models with data.

Step 2: Importing the Data

As stated earlier the dateset is obtained from www.kaggle.com, it is a CSV file and using pandas function data is imported. To get rid of unused columns and rows with missing values are dropped using dropna function.

Step 3: Data exploration

We can observe that, there are 8 predictive variables and one categorical output. There are 768 rows and we can see that there are no null values. The following is the description:

Pregnancy: The pregnancy history of patient i.e number of times the patient was pregnant.

Glucose level: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

Blood Pressure: Diastolic blood pressure (mm Hg).

Skin thickness: Triceps skin fold thickness (mm).

Insulin level: 2-Hour serum insulin (mu U/ml).

Body mass index (BMI): Body mass index (weight in kg/(height in m)²)

Diabetes Pedigree Function: This function is a measure of history of relatives and the genetic relationship of those relatives to the patient.

Age: Age in years.

Outcome: This is a dependent variable, ‘0’ indicates non diabetic person and ‘1’ indicate diabetic person.

Test the balance of data: We will now check the balance of data. Although there are more number of non diabetic people (0), than the diabetic patients(1), even then its alright for modelling the data because there are sufficient number of 1’s

Correlation of data: Correlation between the variables is very helpful in identifying important features to be included in the model. If two variables are correlated with each other positively and has similar relation with the dependent variable we can drop one and therefore, this is useful in eliminating columns(input variables) and reduce the computational cost.

It can be observed that, Glucose is highly related with outcome, whereas SkinThickness seems to be poorly related with it. The variables viz. Glucose, BMI, AGE, Pregnancy and to some extent Insulin appears to contributing to model building.

Violin plot: It helps in understanding as to which predictors are helpful to categorically classify the output variable. Violin plot can be generated using seaborn library using following code:

It can be observed that, Glucose, BMI and to a small amount Age can be helpful for model to classify between diabetic and non diabetic person.

Swarmplot: This is another seaborn function similar to violin plot but gives more insight as to which variable can help model to classify the data. Here is the code for the same:

Again it is observable that, Glucose,BMI are really significant and to some extent Blood Pressure, Diabetic Pedigree Function, Insulin, Age, Pregnancies and interestingly Skin Thickness are helpful in model building.

Step 4: Data Pre-processing & Feature Engineering

At this stage we will preprocess the data to split it into dependent and independent data as well as train and test data set. We will also handle the missing data by replacing zeros with the mean values. Finally, we will standardise the data using standardscalar function of sklearn. Standardisation converts all the data into unit free value ranging between -1 to 1.

Data preprocessing and feature scaling.

Step 5: Model Selection

To find out best algorithm for building a model, a custom class mlmodels is already imported. This class will fit different machine learning algorithms to data and return the k-fold cross validated accuracy score for comparison of models. Accuracy score includes mean as well as standard deviation.

From the above table, SVM classifier (SVC) appears to be better performing with a mean accuracy of 75.88 and standard deviation of 5.71. To improve the performance, we can go for grid search using sklearn library to find out optimal parameters.

Grid search for determination of optimal parameters

The GridSearchCV from sklearn is a tool which will perform hyperparameter tuning of the model to achieve best possible result. In the code given below, you can observe there is a list of parameters in the form of dictionary, through which, GridSearchCV iterate through and find out the best parameters for model. In this example, GridSearchCV is iterating through Support Vector Classifier (SVC) parameters like Regularisation parameter(C), kernel and gama to determine the best combination. Finally, best parameters were found to be Regularisation parameter(C):1 and kernel : ‘linear’.

ANN Modelling

Let us try building ANN model for this data, we may get better result. I am using keras library for this purpose, the activation function considered are ‘relu’ and ‘softmax’ function, the default ‘Adam’ solver is chosen and loss function ‘sparse_categorical_crossentropy’ is considered.

But unfortunately, the maximum accuracy with epochs ranging from 1000 to 3000 is found to be only 72% which is less than accuracy of SVC.

Therefore, we will choose SVC model as our final model for saving and prediction.

Step 5: Saving model and prediction using saved model.

Finally, we will save the model in pickle format and load it again for testing.

We can see that, the loaded model is able predict for ‘ X_test’ data with an accuracy of 80.05%!. This means the model is reliable and can be used to predict probability of patient being diabetic.

Conclusion

The article presented step by step implementation modelling the Pima Indians Diabetes Database from kaggle. How to import, analyse, model data was covered along with model selection, hyperparameter tuning and saving/loading model. SVC algorithm is found to be performing better you can find the code at https://github.com/muralimambekar/diabetes_model.

Thanks for reading.

--

--