New Approach to learn! Diabetes Prediction using a Custom Pipeline.

Hey there everyone, I am Atul, hope so you will keep me in your brain-cells after this post. From Today, i’m starting a series of blogs which will answer the below questions for starter:

  1. What is Data Science?
  2. Why do we need Data Science?
  3. How do i approach a Data Science Problem?

These 3 are the very essential questions which i feel should be in the vein of every aspiring Data Scientist, but it’s just one Man’s Opinion!.

My Motive:

  1. Perform a Full Hands On in order to give a gist how things work in a Data Science problem.
  2. Then explain the Hands On using some great analogies and how does the work done by us can help in the real world.
  3. Finally, we will move to the Generalisable answer to the above Questions.

Problem Statement:

An XYZ Medical Institute is trying to run survey to study the pattern of Diabetic people out there. On a basic level info, the Sugar patients tends to have a high blood sugar and high Blood pressure problems, but it becomes more dangerous when the Cholesterol level is high for a particular patient. So, they want to build a simple Machine Learning Model in order to identify the patients and the propose the same to the Board meeting.


  1. Data Understanding

Let’s load the data and try to look for Shape, Info and Describe. We need these so that we can get the gist and essence of data set. The things which we get to know are:

i. We have 768 rows and 9 columns(Pregnancies,Blood Pressure,Insulin,Glucose,Skin Thickness,BMI,Age,DiabetesPedigreeFunction and Outcome).

Info -> Shape -> Describe

ii. Then we also get to know that there are no null values, using the info() command, quite handy!

iii. Using Describe we can see that the mean age of people recorded are 33 years. This is just one of the EDA Inferences which we extracted. Similarly, many more can be done. For example: bin out the age column and check for BP,Insulin parallel to it. Moreover, look for the Outcome variable with AGE variable, take this as WIP tasks for your motivation!.

iv. The Target variable is the Outcome variable and the rest are the independent variables from which we will trying to model something out.

v. Now, with EDA, we tried to look at the conversion rate, that is out of total people from Outcome how many get converted to Diabetic Patients. Why this, if needed, we can check the conversion rate after modelling to find how good model is working out!

Image for post
Image for post
Checking Conversion Rate

2. Data Manipulation

i. In order to eliminate the Outliers, we need to know which of the features are actually stretching out. The best way is to look column by column in order to find the relation, outliers, but if you have enough number of features, you can draw out a PairPlot and directly look for the features which are having outliers.

ii. Next in order to eliminate outliers, we can either use Z Score methodology or IQR Methodology. Now i have both of them as an interview question and they asked me which one to choose and why.

Image for post
Image for post
Outlier Checking

To my personal Understanding, both of them are quite useful, it is just that IQR is more robust than z score method. Since Z Score method uses the mean of the feature so, if we have a feature which consists of large number of outlier values, the Z Score method might not give us best result, hence we prefer IQR.

Image for post
Image for post
Outliers Removal

iii. Now what? -> Let’s Check Correlation because we are already here for Modelling.

Image for post
Image for post
Correlation HeatMap

3. Modelling

Let’s have a recap, we did the feature checking with null values, outliers and some EDA related inferences which makes the Data Preparation step complete. Note: the data is much suitable from the source, hence no ROCKET SCIENCE has been done.

Now for modelling, Splitting the Data into independent and Target variable -> Scaling the features -> Splitting into TRAIN and TEST split -> Perform modelling -> Check the evaluation metric.

One of the most important thing is that we need to fix an Evaluation Metric for the Classification and Regression Problem. For our Binary classification problem, we need F1 Score to be our metric as it strikes balance between Precision and Recall so that we can handle the False Positive and False Negatives. If still not sure, use CONFUSION MATRIX to be the metric and tune the model for the sub metric to be more precise.

The next snip shows the scaling and splitting and using Logistic Model over the dataset.

Image for post
Image for post
Modelling is this Much only!

Cool, right, but how do we check the performance of the model?

Image for post
Image for post
Confusion Matrix

Now, here is the task. Let me know in the comments in what other ways, can you help me improve my F1 Score for the Diabetic People as we don’t want to miss out the patients who are diabetic but identified as non-diabetic, as this becomes Type I Error(rejecting the null hypothesis when it is infact TRUE).

What are the features playing up for modelling?

Image for post
Image for post
Feature Importance for Prediction

4. XG Boost and Building a Pipeline.

Now, from here we will try XG BOOST and build a custom pipeline of models which will consist of ‘K nearest neighbors’,‘Decision Tree Classifier’,‘SVM classifier with RBF kernel’,‘SVM classifier with linear kernel’,‘Gaussian Naive Bayes’ and we will select the model(s) based on Mean Accuracy and Std. deviation as we want to find a better model closing in on F1 score.

Image for post
Image for post

As far as XG BOOST implementation is concerned, it needs the data to be transformed into D MATRIX and then processed for modelling. I’ll discuss this in a fresh post.

We find the XG BOOST model to be overshooting too much and hence we have to drop this idea. It is not always possible that the a complex model can solve problems in a much better manner, sometimes they make KHICHDI. Yes you heard it right.!😛

Having the Pipeline needs three things, Libraries, Declaration and Looping over cross_val_score.

Image for post
Image for post
Importing Libraries

Then we define 2 lists for classifier and classifier Labels respectively.

Image for post
Image for post
Lists of Classifier and Labels

Now we loop over the CV Score. CV is the cross validation techniques where data is trained on k-1 folds and tested on kth fold. This is in brief.

Image for post
Image for post
CV scores and that pipeline!

So the final models we choose are SVM with Linear Kernel and Gaussian Naive Bayes as they perform better than KNN,DT and SVM with RBF Kernel.

For the full code access pasting the repo link: Diabetes Repo

Now the Confusion Matrix and the F1 scores for both the model.

SVM Results:

Image for post
Image for post
SVM with Linear Kernel

GNB Results:

Image for post
Image for post
Gaussian Naive Bayes

Now, how to decide which model is optimal one among them? -> The model which is consistent across Precision/Recall and F1 Score and that is the SVM Model with Linear Kernel. This model is the one which has 81% weighted score across all the 3 metrics making the model feasible to the problem statement which we initialised.

One of my colleague asked for the opinion as to which metric from F1 Score should be chosen and from this scenario, it is much clear that the Weighted Average is the one which should be given value.

Why not MICRO and MACRO? MICRO only accounts for True Positives/False Positives and False Negatives while MACRO doesn’t handles class imbalance, hence WEIGHTED is more suited.

5. Conclusion

We need Data Science and Machine Learning to learn and extract crucial information from the prehistoric data which can be used for a better good.It is like learning from past experiences to make future decisions.

Now for approaching a Data Science Problem, is to find what’s the purpose your analysis is going to serve or solve. After that classify it to be a Regression or Classification problem, list down the EDA Activities you wanna perform and list down the models for the respective type of problem and begin!

These kind of solutions can help in developing kits which can be much quicker in finding wether a person is diabetic or not instead of just telling the Sugar Level of a person.

I hope i brought something to the table and feel this is post to be a blueprint for some peers out there looking to answer some ground level questions.

Further going, now since we have seen dataset, EDA techniques, Modelling and Summarisation. I’ll be discussing each of these models technically and giving away some 3 to 5 interview questions with respect to the Learning Algorithm.

The Knowledge Chest has just been unlocked. Stay Tuned, Stay Safe!


GitHub Repo for more

Linked In for staying in touch

Handling Imbalanced Techniques

Thank You🙏

Written by

Data Scientist @WIPRO and Tech Enthusiast. Trying to make this world a better place.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store