# New Approach to learn! Diabetes Prediction using a Custom Pipeline.

Hey there everyone, I am Atul, hope so you will keep me in your brain-cells after this post. From Today, i’m starting a series of blogs which will answer the below questions for starter:

- What is
**Data Science**? - Why do we need
**Data Science?** - How do i approach a
**Data Science Problem?**

These 3 are the very essential questions which i feel should be in the vein of every aspiring Data Scientist, but it’s just one Man’s Opinion!.

My Motive:

- Perform a
**Full Hands**On in order to give a gist how things work in a Data Science problem. - Then explain the Hands On using some great analogies and how does the
**work done**by us can help in the real world. - Finally, we will move to the
**Generalisable**answer to the above Questions.

**Problem Statement:**

An XYZ Medical Institute is trying to run survey to study the pattern of **Diabetic** people out there. On a basic level info, the Sugar patients tends to have a high blood sugar and high Blood pressure problems, but it becomes more dangerous when the **Cholesterol level** is high for a particular patient. So, they want to build a simple **Machine Learning Model** in order to identify the patients and the propose the same to the **Board meeting**.

**Approach:**

**Data Understanding**

Let’s load the data and try to look for **Shape, Info **and **Describe**. We need these so that we can get the gist and essence of data set. The things which we get to know are:

i. We have **768** rows and **9 **columns(**Pregnancies,Blood Pressure,Insulin,Glucose,Skin Thickness,BMI,Age,DiabetesPedigreeFunction and Outcome**).

ii. Then we also get to know that there are no null values, using the **info()** command, quite handy!

iii. Using Describe we can see that the mean age of people recorded are **33 years**. This is just one of the EDA Inferences which we extracted. Similarly, many more can be done. For example: **bin out the age column and check for BP,Insulin parallel to it. Moreover, look for the Outcome variable with AGE variable**, take this as WIP tasks for your motivation!.

iv. The Target variable is the **Outcome** variable and the rest are the independent variables from which we will trying to model something out.

v. Now, with EDA, we tried to look at the conversion rate, that is **out of total people from Outcome how many get converted to Diabetic Patients**. Why this, if needed, we can check the conversion rate after modelling to find how good model is working out!

2. **Data Manipulation**

i. In order to eliminate the Outliers, we need to know which of the **features **are actually stretching out. The best way is to look column by column in order to find the relation, outliers, but if you have enough number of **features, **you can draw out a **PairPlot** and directly look for the features which are having outliers.

ii. Next in order to eliminate **outliers, **we can either use **Z Score methodology** or **IQR Methodology**. Now i have both of them as an interview question and they asked me which one to choose and why.

To my personal Understanding, both of them are quite useful, it is just that **IQR** is more **robust** than **z score** method. Since Z Score method uses the mean of the feature so, if we have a feature which consists of large number of outlier values, the Z Score method might not give us best result, hence we prefer** IQR**.

iii. Now what? -> Let’s Check Correlation because we are already here for Modelling.

3. **Modelling**

Let’s have a recap, we did the feature checking with null values, outliers and some EDA related inferences which makes the **Data Preparation** step complete. Note: **the data is much suitable from the source, hence no ROCKET SCIENCE has been done.**

Now for modelling, **Splitting the Data into independent and Target variable -> Scaling the features -> Splitting into TRAIN and TEST split -> Perform modelling -> Check the evaluation metric.**

One of the most important thing is that we need to fix an **Evaluation Metric **for the **Classification and Regression Problem. For our Binary classification problem, we need F1 Score** to be our metric as it strikes balance between **Precision and Recall** so that we can handle the False Positive and False Negatives. If still not sure, use **CONFUSION MATRIX** to be the metric and tune the model for the sub metric to be more precise.

The next snip shows the scaling and splitting and using **Logistic Model** over the dataset.

Cool, right, but how do we check the performance of the model?

Now, here is the task. Let me know in the comments in what other ways, can you help me improve my F1 Score for the **Diabetic People** as we don’t want to miss out the patients who are diabetic but identified as non-diabetic, as this becomes **Type I Error(rejecting the null hypothesis when it is infact TRUE).**

What are the features playing up for modelling?

4. **XG Boost and Building a Pipeline.**

Now, from here we will try XG BOOST and build a custom pipeline of models which will consist of **‘K nearest neighbors’,‘Decision Tree Classifier’,‘SVM classifier with RBF kernel’,‘SVM classifier with linear kernel’,‘Gaussian Naive Bayes’ **and we will select the model(s) based on Mean Accuracy and Std. deviation as we want to find a better model closing in on F1 score.

As far as XG BOOST implementation is concerned, it needs the data to be transformed into D MATRIX and then processed for modelling. I’ll discuss this in a fresh post.

We find the **XG BOOST** model to be overshooting too much and hence we have to drop this idea. It is not always possible that the a complex model can solve problems in a much better manner, sometimes they make **KHICHDI**. Yes you heard it right.!**😛**

Having the Pipeline needs three things, **Libraries, Declaration and Looping over cross_val_score**.

Then we define 2 lists for classifier and classifier Labels respectively.

Now we loop over the CV Score. CV is the cross validation techniques where data is trained on **k-1** folds and tested on **kth** fold. This is in brief.

So the final models we choose are **SVM with Linear Kernel and Gaussian Naive Bayes** as they perform better than KNN,DT and SVM with RBF Kernel.

For the full code access pasting the repo link: Diabetes Repo

Now the **Confusion Matrix and the F1 scores **for both the model.

SVM Results:

GNB Results:

Now, how to decide which model is optimal one among them? -> The model which is consistent across **Precision/Recall and F1 Score** and that is the **SVM Model with Linear Kernel**. This model is the one which has 81% weighted score across all the 3 metrics making the model feasible to the problem statement which we initialised.

One of my colleague asked for the opinion as to which metric from **F1 Score **should be chosen and from this scenario, it is much clear that the **Weighted Average** is the one which should be given value.

Why not **MICRO and MACRO?** MICRO only accounts for **True Positives/False Positives and False Negatives** while MACRO doesn’t handles class imbalance, hence WEIGHTED is more suited.

5. **Conclusion**

We need **Data Science and Machine Learning** to learn and extract crucial information from the prehistoric data which can be used for a better good.It is like learning from past experiences to make future decisions.

Now for approaching a **Data Science Problem**, is to find what’s the purpose your analysis is going to serve or solve. After that classify it to be a **Regression or Classification** problem, list down the EDA Activities you wanna perform and list down the models for the respective type of problem and begin!

These kind of solutions can help in developing kits which can be much quicker in finding wether a person is diabetic or not instead of just telling the Sugar Level of a person.

I hope i brought something to the table and feel this is post to be a **blueprint** for some peers out there looking to answer some ground level questions.

Further going, now since we have seen dataset, EDA techniques, Modelling and Summarisation. I’ll be discussing each of these models **technically** and giving away some 3 to 5 interview questions with respect to the Learning Algorithm.

**The Knowledge Chest has just been unlocked. Stay Tuned, Stay Safe!**

Resources:

Linked In for staying in touch

Handling Imbalanced Techniques

**Thank You**🙏