Diabetes Prediction Using Linear Regression

A guided data science project

Eulene
5 min readNov 15, 2023

Introduction

Diabetes is a chronic metabolic disorder that poses a significant health challenge globally. Being able to make an early and accurate diagnosis is pivotal for effective management and prevention of complications.

In this project, our main aim is to develop a robust predictive model for diagnosing diabetes based on a variety of diagnostic measurements.

About data set

The data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and has been curated with some specific constraints.

All subjects in the data set are female, at least 21 years old, and belong to the Pima Indian heritage. This targeted selection ensures a focused exploration of diabetes prediction within a specific demographic, enhancing the model’s applicability to a well-defined population.

You can access this data set from Kaggle: https://www.kaggle.com/code/gopalj/diabetes-prediction-using-python/input

Objectives

  1. Classify whether someone has diabetes or not from given features
  2. To determine which features are the most indicative of diabetes

Project Outline

  1. Import libraries and load the data
  2. Explore the data
  3. Correlation Matrix Visualization
  4. Train model: logistic regression
  5. Test and evaluate the model
  6. Conclusion

Import required libraries

Pandas: Simplifies data analysis and manipulation with Data Frames and Series.

NumPy : Powerful for numerical computing, handling large arrays, and offering high-level math functions.

Matplotlib: Your go-to for creating diverse plots like lines, scatter plots, bars, and histograms.

Seaborn: Enhances Matplotlib for stylish and informative statistical data visualization.

Import libraries for prediction

Sklearn: Short for Scikit-learn, a Python machine learning library offering tools for data analysis and model development.

Train Test Split: A technique to divide data into training and testing sets, crucial for evaluating model performance.

Logistic Regression: A method utilizing the logistic function to predict the probability of a binary outcome.

Accuracy: A metric gauging the percentage of correctly classified instances in a classification model.

Load data set

Explore the data

Check for general info on the data set :

This reveals that :

  • the table has 768 rows and 9 columns
  • The table has two data types: 7 integer columns and 2 float columns

Check for null values

It is observed that there are no null values in the data set.

Correlation

A correlation matrix is a graphical representation of the correlation coefficients between different variables in a data set. The correlation coefficient measures the strength and direction of a linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no correlation.

Create a correlation data frame

Create a correlation visualization matrix

Observation

Since our main focus is the outcome, we will observe how independent variables affect the dependent variable (outcome).
It is observed that glucose, BMI, age, and pregnancies have substantial effects on the outcome.

Train the model

Split the data into train and test

x stores all the independent variables

y stores the dependent variables (what we are trying to predict )

Train-test splits the data into two parts: 20% for testing and 80% for training, helping us see how accurate our predictions are.

Model training

In this case, I use logistic regression.

Logistic regression is a statistical method used for binary classification, which means predicting one of two possible outcomes. It models the relationship between a dependent variable (usually representing whether an event occurs or not) and one or more independent variables by estimating probabilities using the logistic function. The logistic function transforms a linear combination of input features into values between 0 and 1, representing the probability of the event happening.

Model Testing

Using the test data, predict the outcome of the test values and test model accuracy .

Observation

The accuracy of the model is approximately 78%

Conclusion

After analyzing the comprehensive set of patient records, I developed a machine learning model; logistic regression with an accuracy of 78% , to precisely forecast whether individuals in the dataset have diabetes.

Furthermore, valuable insights were gained from the data through in-depth analysis and visualization techniques. Showing that some of the factors that highly influence the occurrence of diabetes include: glucose levels, age, and BMI.

--

--