An Exploration of the Diabetes Dataset

Umesha Sanjali
8 min readMay 28, 2024

--

Welcome, fellow data enthusiasts! Now I share my journey of exploring data and making predictions using R.Using the Diabetes dataset, which was obtained from a trustworthy source, I will explore the intriguing fields of data analysis and predictive modelling in this post.

In this article, you can learn,

How to investigate and get ready for data analysis.

Techniques for creating interactive visualizations with R Shiny.

How to identify patterns within the data and understand their significance.

The process of making predictions using regression models.

How to extract valuable insights by fusing visualisations with predictions.

Exploring Data

First I Get the Data Ready for Analysis.To begin my exploration, I downloaded the diabetes dataset from the UCI Machine Learning Repository. I got my data set from kaggle.com and dataset was published by Mr.Akshay Dattatray Khare .I will attach the link below of the dataset. The dataset includes several attributes related to diabetes patients, such as age, BMI, blood pressure, and glucose levels.

Diabetes Dataset (kaggle.com)

Firstly I Import my dataset to R Script and call the packeges.I called “ggplot2” , “datasets” and other packeges.Then I got the summary of the dataset and identify any missing Value in the dataset.

Here’s a preview of the first steps I made to get the data ready for analysis :

Here are the results of the analysis :

Identify any missing Value in the dataset :

Make sure no any missing values

Let’s go through the provided R code step by step to explain the details and results in a comprehensive manner :

### Loading Libraries

These lines load the necessary libraries for data manipulation (dplyr) and visualization (ggplot2).

### Loading the Dataset

This line reads the dataset from a CSV file located at the specified path and stores it in a variable named diabetes_data.

### Displaying the First Few Rows of the Dataset

This function call displays the first six rows of the diabetes_data dataframe.

### Summarizing the Dataset

This function call provides summary statistics for each variable in the dataset. The output includes the minimum, first quartile, median, mean, third quartile, and maximum values for each variable.

Interesting Findings

During my exploration,I discovered some fascinating information. For Example, I found that Age, BMI, and blood pressure were significant predictors of diabetes. Additionally, I observed a strong correlation between glucose levels and the diabetes outcome and The majority of the patients fall diabetes within the 30–50 age range.Also higher BMI values are prevalent among patients with diabetes.

Let’s go through the interpretation of the Summary. Through this, I discovered some interesting facts.

Pregnancies : The number of pregnancies ranges from 0 to 17, with a median of 3.

Glucose : Glucose concentration levels range from 0 to 199, with a median of 117.

BloodPressure : Blood pressure levels range from 0 to 122, with a median of 72.

SkinThickness : Skin thickness measurements range from 0 to 99, with a median of 23.

Insulin : Insulin levels range from 0 to 846, with a median of 30.5.

BMI : Body mass index ranges from 0 to 67.1, with a median of 32.

DiabetesPedigreeFunction : This score ranges from 0.078 to 2.42, with a median of 0.3725.

Age : Ages of the participants range from 21 to 81, with a median age of 29.

Outcome : This is a binary variable indicating diabetes status, with values 0 (no diabetes) or 1 (diabetes). The median is 0, indicating that more than half of the participants do not have diabetes.

This detailed summary provides an overview of the dataset’s characteristics, which is crucial for further analysis and modeling.

Working with R Shiny

Creating Interactive Graphs and Charts ;-

I utilise an interactive Shiny Web Application to enhance the participatory nature of my exploration. This app allows users to visualize various aspects of the diabetes dataset, such as the allocation of glucose levels and the relationship between BMI and Diabetes. I used this shiny web application for create dynamic graphs and charts that allow users to explore the diabetes dataset visually and with this app, I interactively explore how glucose levels and age affect diabetes outcomes.

I followed the following steps in shiny web application for explore how glucose levels and age affect diabetes outcomes .

  1. Loading the Data
  2. Setting up the UI (User Interface)
  3. Defining Server Logic
  4. Incorporating Plotting Libraries
  5. Run the shiny web app and get the output
Steps in shiny web application
Output of shiny web app

According to above data analysis,we can select an age range using the slider input. The app filters the dataset to include only individuals within the selected age range.As well as The main panel displays a histogram of glucose levels for the filtered age group. The bars are filled based on the diabetes outcome.(As the example different colors for diabetic and non-diabetic individuals).As we change the age range using the slider, the histogram updates in real-time to reflect the glucose levels for the new age range.

What Users Can Do

We can investigate interactively how age and glucose levels impact the effects of diabetes with this app.

Following is the data obtained according to the data analysis I have done with R shiny.It’s include how age and glucose levels impact the effects of diabetes.

Recognizing Patterns

I found several patterns as I looked more closely at the data.As the example,

· Glucose Levels : Higher glucose levels are more frequently associated with a positive diabetes outcome.

· Age and BMI : Older individuals with higher BMI are at greater risk.

These patterns identify important areas that warrant additional research and possible points of intervention.

Here are some visualization that illustrate these points :-

For that I used Boxplot of Glucose Levels by Outcome ,Linear Plot of BMI vs. Age.

Boxplot of Glucose Levels by Outcome

Linear Plot of BMI vs. Age.

Also I used ScatterPlot ,PieChart and Histogram for data visualization.I followed the following steps in shiny web application for draw that charts and plots.

ScatterPlots and PieChart :

ScatterPlot of Pregnancies vs Glucose
ScatterPlot of Blood Pressure vs Glucose
ScatterPlot of Skin Thikness vs Glucose
ScatterPlot of Age vs Glucose
ScatterPlot of BMI vs Glucose

Scatter Plots : We can select a variable to plot against glucose levels. The scatter plot will show the relationship between the selected variable and glucose levels for the filtered age group.

No Diabetes vs. Diabetes

Pie Chart : Displays the distribution of diabetes presence (No Diabetes vs. Diabetes) within the filtered age group.

Histogram :

The final result is a histogram that visualizes the distribution of glucose levels in the diabetes_data dataset. The histogram will have the following characteristics :

The x-axis represents glucose levels, divided into bins of width 10.
The y-axis represents the frequency of observations within each bin.
Bars are filled with a blue color and have black borders.
The title of the plot is “Distribution of Glucose Levels”.

Looking at the above data, it is possible to explain how other causes affect diabetes.

Making Predictions (Using Regression Models)

I used regression model to predict diabetes outcomes based on various factors such as glucose level, BMI, age, and blood pressure.

Here are the steps ;-

· Building the Model : Used the lm() function to build the regression model.

· predictions represents the predicted values from this model.

· (diabetes_data$Outcome) represents the actual outcome values from this dataset.

· (predictions — diabetes_data$Outcome) calculates the errors between the predicted and actual values.

· ^2 squares each residual. Squaring is used because it makes all error values positive.

Insights from Predictions

According to the regression model, BMI and glucose level are important indicators of the consequences of diabetes. Early diagnosis and intervention can benefit from the model’s predictions.

Here are the results of regression model ;-

Putting It All Together

By integrating logistic regression model with the Shiny app, we can see real-time predictions based on their input. Predictive analytics and interactive visualisation together provide a potent tool for data driven decision making.

According to my data analysis,each unit increase in glucose level increases the log-odds of having diabetes and each unit increase in BMI increases the log-odds of having diabetes.Also each year increase in age increases the log-odds of having diabetes and each unit increase in blood pressure slightly decreases the log-odds of having diabetes.It is clear from the above results of regression model.

From this analysis, I learned the importance of thorough data preparation,the effectiveness of interactive visualisation in revealing insights,and the value of predictive models in decision making.

I Learn how to prepare data for analysis and uncover hidden insights and how to Explore the power of R Shiny to create dynamic and interactive graphs and charts.Also I Understand the significance of identifying patterns in data and their implications and learn how to utilize regression models to make accurate predictions.

Using the Diabetes dataset in R, I set out on a voyage of data exploration and prediction for this article. I learned a lot about diabetes risk factors by using interactive visualisations, looking for patterns, and using regression models to make predictions.With the resources and understanding you’ve obtained from this article,I encourage you to embark on your own data exploration journey and also I encourage you to try it out yourself and uncover the insights hidden in your data.

Happy analyzing!

--

--