Testing K-Means Clustering on a Diabetes dataset — a real-life use case

Published in

Geek Culture

5 min readAug 12, 2022

Image from Yale Medicine — AI in Medicine

Innovation and technology-led solutions have long played a crucial role in healthcare. Whether diagnosis or clinical trials, healthcare professionals have long relied on technology to improve treatment and patient satisfaction. At the heart of these technological solutions sits data — containing vital information. The last two decades have seen an uptake in the adoption of tools such as body sensors and smart watches, giving scientists access to high volumes of data packed with valuable information.

This accumulation of data has led many industries to adopt and implement various data-centred solutions such as machine learning (ML). The outputs from ML models have drastically improved due to the data-richness and volume of available data. Each machine learning algorithm utilises a different statistical method and offers a unique output when applied to a dataset. Overall, the problem context dictates which algorithm is best suited for that problem. Depending on the nature of the problem, one can generate multiple models to reach a clear conclusion, a process known as Ensemble Modelling.

In this blog, I will explore the role of different variables in a diabetes dataset taken from the National Institute of Diabetes and Digestive and Kidney Diseases and whether K-means clustering can separate patients into two clusters — Diabetic and Non-diabetic based on the measurements.

The Model

K-means clustering is a popular algorithm used to solve various problems relating to generating clusters or subsets within a dataset. The formation of clusters is homogenous, meaning data points within each cluster share similar properties and have been grouped based on similar features. Unlike others, K-means clustering is an unsupervised machine learning algorithm that takes data without labels and makes inferences entirely from the dataset. For the maths underpinning the algorithm, I would encourage checking out the links above, however, in this blog, my focus is on the use case of K-means clustering in a healthcare setting.

Using this technique, patients in this dataset can be grouped based on differences between the assessed biomarkers. The dataset contains eight diagnostic measurements -number of pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, BMI, diabetes pedigree function and age.

After running the model in R, I generated a confusion matrix to explore how well K-means clustering distinguished diabetic from non-diabetic individuals based on the variation in the biomarkers.

Although the overall model results are poor, the algorithm is better at identifying the attributes associated with diabetes. It accurately identified 64% of diabetic individuals compared to 16% of non-diabetic ones. It is vital to highlight that the model is constrained by the dataset, meaning a restricted dataset will lead to imprecise results, and this dataset is massively constrained. We’re limited to a single ethnic group and one gender, and the overall size of the dataset is minimal.

Diabetes — The interplay between different factors

The results from the model were far off when compared to the actual data in the confusion matrix. However, aside from the limitations of the dataset, the complex nature of type 2 diabetes adds an extra layer of complication when building models to identify the contribution of various factors.

Type 2 diabetes is diagnosed based on hyperglycaemia (high glucose level in the blood). The disease has sophisticated aetiology and is mainly a result of excessive, yet reversible, fat buildup in the liver and pancreas, worsening hepatic response to insulin (a hormone regulating blood sugar levels) resulting in increased production of glucose. Over the years, various studies have mapped out different subtypes of type 2 diabetes. The unanimous conclusion of these studies is that type 2 diabetes is polygenic and involves lifestyle choices and environmental factors with genetic predisposition dictating the baseline severity of the disease.

When analysed based on individual factors, BMI and High Blood Pressure showed some level of relationship with the diabetic individuals in our dataset. Evidence from clinical data suggests that hypertension and type 2 diabetes are comorbidities. The interplay between their pathophysiological features may lead to further complications. Furthermore, a study involving 12,550 adults aged between 45 and 64 showed that type 2 diabetes was 2.5 times as likely to develop in patients suffering from hypertension, casting light on the role of insulin resistance in the pathogenesis of hypertension.

The body mass index (BMI) is measured to estimate if the weight is proportionate to height. The BMI range between 18.5 and 25 is considered normal. In this dataset, 84% of the individuals (Diabetic and Non-Diabetic) have BMI above 25, categorising most of the sample as overweight or obese. Studies carried out in primary care settings reveal that 60% to 76% of overweight or obese patients suffer from hypertension. Evidence also suggests that pro-inflammatory cytokines released from adipose tissues (body fat) may promote vascular insulin resistance and inflammation.

Quest for a better understanding of the biology underpinning disease development

As can be seen in the plot, BMI and High Blood Pressure are not strong predictors of diabetes. The sample size and diversity are in part responsible for this output. The dataset consists of only females belonging to one ethnic group making the sample distinct, however, pivotal for studying biomarkers specific to this ethnic group. In comparison to the findings in the studies above, entailing large and diverse samples, BMI and High Blood Pressure does not show a strong correlation to the development of type 2 diabetes as shown in the plot.
It explains why K-means clustering struggled to group the data points into Diabetic and Non-Diabetic based on the variation in these biomarkers. The nature of the disease also makes it difficult to pinpoint what specific factors are involved in the initial stages of the disease and how these factors might influence disease severity. The limitations of the dataset further complicate the identification of different factors involved considering we can only explore the variation in females belonging to the Native American ethnic group. Lastly, the sample size is insubstantial. When working with a complication such as type 2 diabetes, an algorithm would need a large and diverse dataset to identify the key components responsible for differences between diabetic and non-diabetic individuals.

I wrote this blog to showcase how K-means clustering can offer additional insight in line with other in silico methods on a healthcare dataset in real life. Although the model results are not jaw-dropping or exciting, it highlights the need for good quality data — essential for machine learning’s advancement in healthcare. My biggest takeaway from working on this project was the realisation of the immense value machine learning offers in disease development. Complications such as type 2 diabetes present a unique opportunity for machine learning algorithms to identify molecules and factors swiftly compared to conventional scientific methods that generally take years. The future of machine learning within healthcare is promising. However, the rate of adoption and advancement is subject to the accessibility and usability of data.

Testing K-Means Clustering on a Diabetes dataset — a real-life use case

The Model

Diabetes — The interplay between different factors

Quest for a better understanding of the biology underpinning disease development

Written by Marshall Petros