Predicting Diabetes in Women: A Machine Learning Approach Episode 1:

Divya Chandana

Published in

The Deep Hub

5 min readFeb 11, 2024

Harnessing Machine Learning Tools for Diabetes prediction

Introduction

Why is it important to safeguard women’s health from diabetes? Approximately 15 million women in the US are grappling with diabetes, a condition that can lead to severe health issues like heart attacks, strokes, and kidney failure. Understanding and addressing this pressing issue is crucial now more than ever. In this blog, we’ll dive into the realm of diabetes prediction for women, utilizing machine learning tools to address this significant health concern.

Context

The subset of the dataset we’re using originates from the National Institute of Diabetes and Digestive and Kidney Diseases. It’s carefully curated, focusing on females, with the primary objective of predicting diabetes based on specific diagnostic measurements.

Data Attributes

The dataset contains various medical terms. Let’s break down what each attribute means

Pregnancies: Number of times a patient has been pregnant
Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test
BloodPressure: Diastolic blood pressure (in mm Hg)
SkinThickness: Triceps skin fold thickness (in mm)
Insulin: 2-hour serum insulin levels (in mu U/ml)
BMI (Body Mass Index): Calculated body mass index using weight and height (in kg/m²)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age of the patient
Outcome: A class variable indicating whether the patient has diabetes (1 for diabetic, 0 for non-diabetic)

Understanding Data

We’ll begin by loading the data into the dataframe and checking basic information needed. The dataset contains 768 rows, and all columns are either integers or floats, indicating no categorical data.

Using df.describe(), we can gain descriptive statistics for numerical columns, including count, mean, standard deviation, minimum, maximum, and quartile values.

Next, we’ll check for any null values in the dataset, and fortunately, there aren’t any.

Moving on to plots, box plots helping to understand the distribution of data and identify outliers.

Positively skewed data indicates a greater distance from the median to the maximum compared to the minimum, suggesting the presence of outliers.

This histogram shows significant correlation between blood sugar levels and the likelihood of diabetes. However, the overlapping box plots for blood pressure in diabetic and non-diabetic groups may cloud judgment.

Further analysis involves comparing these features between diabetic and non-diabetic groups to uncover potential patterns and insights.It’s worth noting the presence of 0’s in glucose, blood pressure, and BMI attributes, which shows inconsistent data. Additionally, the dataset contains a considerable number of outliers, which will be addressed through normalization.

Feature Engineering

We enhance the dataset by introducing new features such as age group, blood pressure group, BMI group, and is-pregnant group. These additional features aim to provide additional context and potentially improve the model accuracy.

Analysis of bar graphs shows concerning trends, such as similar diabetes risk levels between adults and middle-aged individuals. Moreover, comparing pregnant to never-pregnant individuals shows a higher ratio of diabetes, emphasizing the need to note that not all pregnant women develop diabetes.

When looking at the correlation, the glucose shows the highest correlation of 0.47 with the outcome, indicating a positive correlation between glucose levels and the likelihood of diabetes. Additionally, other features like BMI and age also show moderate correlations with the outcome.

Further examination of internal feature correlations reveals moderate correlations between age and pregnancies, as well as between BMI and skin thickness. These findings suggest potential relationships between these factors, which may contribute to predicting diabetes in women.

Feature Importance

Glucose seems to be top-performing feature, followed by BMI, diabetes pedigree function, and age.

The newly created features seems less influential such as age group, blood pressure group, BMI group, and is-pregnant group considered to be removed to improve model performance.

Normalization

Data normalization, specifically scaling, is performed to ensure that all features contribute equally to the model’s predictions. This preprocessing step helps in improving model convergence and performance.

Modelling

Several machine learning models, including Decision tree, Random Forest, Gradient Boosting, Support Vector Classifier (SVC), are trained and evaluated.

Cross validation

After testing with the test data, Random Forest demonstrates the highest accuracy of 74.68%.

Best Random Forest Test Accuracy: 0.7468
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}

Re-Check Data

Further analysis is conducted to identify potential improvements in the model’s performance. Strategies such as handling zero values and hyperparameter tuning for Random Forest are explored to enhance model accuracy.

Result

The model’s accuracy is significantly improved to 78.57% after fine-tuning and adjustments. A detailed confusion matrix illustrates the model’s performance in predicting true positives, false positives, true negatives, and false negatives.

Test set evaluation (after hyperparameter tuning):
Best Random Forest Test Accuracy: 0.7857
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 50}

Confusion Matrix:
[[84 15]
 [18 37]]

True Positive (TP): 37 cases were correctly predicted as diabetic.
False Positive (FP): 15 cases were incorrectly predicted as diabetic when they were actually non-diabetic.
True Negative (TN): 84 cases were correctly predicted as non-diabetic.
False Negative (FN): 18 cases were incorrectly predicted as non-diabetic when they were actually diabetic.

Conclusion

While significant improvements have been made in predicting diabetes using machine learning techniques, there are still areas of concern and room for improvement, particularly regarding false predictions. Further research and improvement of predictive models needs to done.

Stay tuned forEpisode:2 future research using new models, advanced feature engineering techniques and enhancements in diabetes prediction for women’s health. Our goal is to ensure better outcomes and quality of life for individuals at risk of diabetes.