Analyzing the risk of becoming diabetic using Data Science

Karthik Sundar
Geek Culture
Published in
4 min readSep 18, 2021
image source — here

I always wonder if someone could foretell about the health condition we might end up in, given the information about our lifestyle. If some one can predict that I might be diabetic in few years, I might well hit gym, eat consciously and improve my habits soon.

Turns out we can ! Given that we are living in the world of Big Data, we can use various features to predict risk of getting diabetes.

I found a wonderful data in Kaggle(link) uploaded by UCI. It has records of metabolic state (like glucose level, BMI) of both diabetic and non-diabetic patients.

Using this data, I found the answers to the following questions, which has been in my mind for a long time.

1. Do people with high Glucose level are more vulnerable to diabetes?

2. There is a general notion that overweight people are at high risk of diabetes. Is it true?

3. Are elder people more prone to diabetes?

4. Is diabetes generally hereditary?

5. Does the number of pregnancies experienced increase the risk of getting diabetes?

In this article let us try to answer these questions.

Overview

I calculated the importance of few features that can explain the risk of a person getting diabetes. The resulting image is shown below.

Feature Importance

As we can see, factors like Glucose level, BMI(Body Mass Index), number of pregnancies have a strong effect on the risk of diabetes. DiabetesPedigreeFunction is a measure that tells us how prone are we to get diabetes, given our hereditary factors. More the DiabetesPedigreeFunction, more is the risk.

1. Do people with high Glucose level are more vulnerable to diabetes?

From the data we find that,

Average Glucose level of a diabetic patient- 142.16

Average Glucose level of a non-diabetic patient- 110.71

Thus, it is evident that people with high glucose levels are more prone to diabetes.

2. There is a general notion that overweight people are at high risk of diabetes. Is it true?

Since BMI directly translates to overweight/obesity, it can answer our question.

Average BMI of a diabetic patient- 35.38

Average BMI of a non-diabetic patient- 30.88

So the general notion is true. Data suggest that a person with high BMI (in other words an overweight person) is at high risk of diabetes.

3. Are elder people more prone to diabetes?

We can infer from the data that as we age older, we are more vulnerable to diabetes.

4. Is diabetes generally hereditary?

As discussed before, DiabetesPedigreeFunction is a measure that tells us how prone are we to get diabetes, given our hereditary factors. More the DiabetesPedigreeFunction, more is the risk.

Average DiabetesPedigreeFunction value of a diabetic patient- 0.55

Average DiabetesPedigreeFunction value of a non-diabetic patient- 0.43

Since the DiabetesPedigreeFunction directly translates to hereditary attributes, we can infer that if a person as hereditary history of diabetes, he is more prone to diabetes than the person who is not.

5. Does number of pregnancies increase the risk of getting diabetes?

We can infer from the data that as the number of pregnancy experienced by a patient increases, the risk of getting diabetes also increases.

Conclusion

In this article, we analyzed the diabetes dataset and inspected various features that can explain the risk of becoming diabetic. The following is the summary of we have done.

  1. Gathered the diabetes dataset from Kaggle.
  2. Trained a Machine learning model to generate the feature importance
  3. Collected statistics about how Glucose levels, BMI, age, DiabetesPedigreeFunction, Number of pregnancies experienced etc influence the risk of diabetes
  4. Plotted required graphs to analyze how the above features affect the risk of becoming diabetic.

With more data, more features and complex algorithms, we can model the risk of getting diabetes even more accurately.

Thank you for reading. Start working, eat consciously and reduce the possibility of becoming diabetic. :)

All the code base can be found in the link here- GitHub link

--

--