“Exploring the End-to-End Process of Diabetes Prediction Machine Learning Project”

Vishal Shelar
6 min readJul 2, 2023

--

Table of contents :-

· 1. Introduction :-
·
2. Problem Statement :-
·
3. Dataset Description :-
·
4. Data Pre-processing :-
·
5. Feature Engineering :-
·
6. Model Training & Testing :-
·
7. Model Deployment :-

1. Introduction :-

In this blog, we will explore a machine learning project that aims to predict diabetes using a dataset called “diabetes”.

Diabetes prediction is important because it helps us identify people who are at risk of developing the disease. By accurately predicting diabetes, we can take early action to prevent or manage it effectively. This can lead to better health outcomes and help people make healthier choices.

Throughout this blog, we will go through the step-by-step process of building a machine learning project. We will start by cleaning and organizing the “diabetes” dataset. Then, we will choose the most important factors that can help us predict diabetes accurately.

2. Problem Statement :-

The main goal of this diabetes prediction project is to create a machine learning model that can accurately predict the chances of individuals developing diabetes. We will analyze data related to their health and lifestyle factors to build a model that can estimate the risk of diabetes with great accuracy.

3. Dataset Description :-

The dataset used for the Diabetes Prediction project was sourced from Kaggle, a popular platform for data science and machine learning enthusiasts.

Attribute Information :-

The dataset for this diabetes prediction project contains the following columns:

  1. Pregnancies: This column represents the number of times a person has been pregnant.
  2. Glucose: This column represents the blood sugar level (glucose concentration) measured in the person’s body.
  3. Blood Pressure: This column represents the blood pressure level of the person in millimeters of mercury (mmHg).
  4. Skin Thickness: This column represents the skin thickness measured in millimeters.
  5. Insulin: This column represents the insulin level in the person’s body measured in milli-international units per milliliter (mu/ml).
  6. BMI (Body Mass Index): This column represents the individual’s body mass index, which is a measure of body fat based on height and weight.
  7. DiabetesPedigree: This column represents the diabetes pedigree function, which provides an indication of the genetic influence of diabetes based on family history.
  8. Age: This column represents the age of the person in years.
  9. Outcome: This column indicates the presence or absence of diabetes, where 1 represents the presence and 0 represents the absence.

4. Data Pre-processing :-

#Let's start with importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

#read the data file
data = pd.read_csv('../Dataset/diabetes.csv')
data.head()

Output :-

# Dataset Description
data.describe()

Output :-

# Checking for null values :-

data.isnull().sum()

Output :- There are no null values

# We can see there few data for columns Glucose , Insulin, skin thickness, BMI and Blood Pressure which have value as 0. That’s not possible,right? you can do a quick search to see that one cannot have 0 values for these.

Let’s deal with that. we can either remove such data or simply replace it with their respective mean values.

#here few misconception is there lke BMI can not be zero, BP can't be zero, glucose, insuline can't be zero so lets try to fix it
# now replacing zero values with the mean of the column
data['BMI'] = data['BMI'].replace(0,data['BMI'].mean())
data['BloodPressure'] = data['BloodPressure'].replace(0,data['BloodPressure'].mean())
data['Glucose'] = data['Glucose'].replace(0,data['Glucose'].mean())
data['Insulin'] = data['Insulin'].replace(0,data['Insulin'].mean())
data['SkinThickness'] = data['SkinThickness'].replace(0,data['SkinThickness'].mean())

5. Feature Engineering :-

# Scaling :-

#segregate the dependent and independent variable
X = data.drop(columns = ['Outcome'])
y = data['Outcome']

# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)


##standard Scaling- Standardization
def scaler_standard(X_train, X_test):
#scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


return X_train_scaled, X_test_scaled


X_train_scaled, X_test_scaled = scaler_standard(X_train, X_test)
X_train_scaled

Output :-

6. Model Training & Testing :-

# Logistic Regression :-

log_reg = LogisticRegression()

log_reg.fit(X_train_scaled,y_train)

Output :-

## Hyperparameter Tuning
## GridSearch CV
from sklearn.model_selection import GridSearchCV
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# parameter grid
parameters = {
'penalty' : ['l1','l2'],
'C' : np.logspace(-3,3,7),
'solver' : ['newton-cg', 'lbfgs', 'liblinear'],
}
logreg = LogisticRegression()
clf = GridSearchCV(logreg, # model
param_grid = parameters, # hyperparameters
scoring='accuracy', # metric for scoring
cv=10) # number of folds

clf.fit(X_train_scaled,y_train)

Output :-

y_pred = clf.predict(X_test_scaled)
# Confusion Matrix
conf_mat = confusion_matrix(y_test,y_pred)
conf_mat

Output :-

true_positive = conf_mat[0][0]
false_positive = conf_mat[0][1]
false_negative = conf_mat[1][0]
true_negative = conf_mat[1][1]
Accuracy = (true_positive + true_negative) / (true_positive +false_positive + false_negative + true_negative)
Accuracy

Output :-

Precision = true_positive/(true_positive+false_positive)
Precision

Output :-

Recall = true_positive/(true_positive+false_negative)
Recall

Output :-

F1_Score = 2*(Recall * Precision) / (Recall + Precision)
F1_Score

Output :-

7. Model Deployment :-

In my diabetes prediction project, I utilized the logistic regression model as the sole machine learning algorithm. The logistic regression model achieved an accuracy rate of 79.68% in accurately predicting diabetes.

To proceed with deployment, I saved the trained ridge model into a pickle file, ensuring its preservation and compatibility across different platforms. This pickle file serves as a container for the trained model, encapsulating its parameters, structure, and functionality.

#saving the model
import pickle
file = open('../Model/standardScalar.pkl','wb')
pickle.dump(scaler,file)
file.close()

file = open('../Model/modelForPrediction.pkl','wb')
pickle.dump(log_reg,file)
file.close()

For deployment, I chose Streamlit, a user-friendly framework that simplifies the creation of interactive web applications. Streamlit offers an intuitive platform to demonstrate the predictive power of the ridge model in a visually appealing and easy-to-use manner.

Here is the link to access the code for Diabetes prediction project :-

Here is the link to access the live demo of Diabetes prediction project : -

Thank you for taking the time to read my blog. Your support and engagement mean the world to me. I sincerely appreciate your interest in my project and hope that it has provided you with valuable insights. Your continued readership and feedback inspire me to keep sharing knowledge and striving for excellence. Thank you for being a part of this journey.

--

--