DATA STORIES | PREDICTIVE ANALYTICS | KNIME ANALYTICS PLATFORM

ML for Diabetes Prevention with KNIME

#1 Place in the ML challenge jointly organized by the University of Milan-Bicocca & KNIME

Vittorio Haardt

Published in

Low Code for Data Science

16 min readMay 29, 2023

Co-author: Luca Porcelli

Diabetes is a widespread chronic disease affecting millions of people worldwide. It is caused by the body’s inability to regulate glucose levels in the blood and can result in serious complications, such as heart disease, vision loss, lower-limb amputation, and kidney disease. Affected patients experience a significant reduction in the quality of life and decrease in life expectancy. As of 2018, 34.2 million Americans have diabetes, with 88 million having prediabetes. Many are unaware of their risk and the disease disproportionately affects lower socioeconomic groups.

Additionally, diabetes often represents a considerable financial cost for a country’s health care system. Indeed, the cost of diabetes is estimated at $327 billion for diagnosed cases and $400 billion for undiagnosed and prediabetic cases.

While there is currently no permanent cure, early diagnosis and a healthy lifestyle can minimize the risk of developing chronic diabetes with beneficial effects both on a country’s population and its health care system. To this end, AI and machine learning models to predict early on the onset of diabetes are crucial tools for public health officials and people that are at risk.

In the framework of a machine learning challenge jointly organized by the University of Milan-Bicocca and KNIME, we leveraged the power of predictive modeling to identify the risk of developing diabetes. The results of the analysis revealed both insights into the risk factors and the use of a low-code tool like KNIME Analytics Platform for data exploration, model training and development. We hope that the findings of this project may ultimately help healthcare professionals improve early diagnosis and reduce the negative impacts of this chronic disease on people’s lives.

Data Access & Preprocessing

The Dataset

To conduct our analysis, we utilized a Kaggle dataset that was sourced from the Diabetes Prediction Competition [AashiDutt (2022)].

The data provides valuable insights into the factors that can influence the development of diabetes and consists of 18 attributes: 17 of which represent various health factors, and a binary target attribute, “Diabetes”, that needs to be predicted. In other words,17 attributes serve as input features, and the goal is to use these features to accurately predict whether an individual has diabetes or not.

● age: 3-level age categories: 1 = 18–24, 9 = 60–64, 13 = 80 or older.

● sex: 0 = female, 1 = male.

● HighChol: 0 = no high cholesterol, 1 = high cholesterol.

● CholCheck: 0 = no cholesterol check in 5 years, 1 = yes cholesterol check in 5 years

● BMI: Body Mass Index.

● Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no, 1 = yes.

● HeartDiseaseorAttack: Coronary heart disease (CHD) or myocardial infarction (MI) 0 = no, 1 = yes.

● PhysActivity: Physical activity in the past 30 days — not including job 0 = no, 1 = yes.

● Fruits: Consume Fruit one or more times per day 0 = no, 1 = yes.

● Veggies: Consume Vegetables one or more times per day 0 = no, 1 = yes.

● HyAlcoholConsump: Adult male: more than 14 drinks per week. Adult female: more than 7 drinks per week. 0 = no, 1 = yes.

● GenHlth: Would you say that in general your health is: (scale 1–5) 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor.

● MentHlth: Days of poor mental health scale 1–30 days.

● PhysHlth: Physical illness or injury days in the past 30 days scale 1–30.

● DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no, 1 = yes.

● Hypertension: 0 = no hypertension, 1 = hypertension.

● Stroke: 0 = no, 1 = yes.

● Diabetes: 0 = no diabetes, 1 = diabetes (target variable).

As can be seen from the description, most of the attributes in the dataset are binary or have Boolean values. These attributes indicate the presence or absence of certain health factors or conditions. The exceptions to this are age and GenHlth, which are ordinal attributes, and BMI, the only continuous attribute in the dataset.

The dataset is almost perfectly balanced between the two classes of the target attribute, an ideal condition for training a machine learning model. Indeed, the model has an equal number of examples from both classes to learn from, reducing the chance of biased learning and boosting model generalizability.

Preprocessing

The dataset is already fairly clean and well-prepared, suggesting that it underwent a previous cleaning process that removed noise and inconsistencies from the raw data.

Nethertheless, we performed a few preprocessing steps to handle missing values, treat outliers, check for near-zero variance and highly correlated features, and select the best subset of attributes for modeling.

In particular, we used the Missing Value node to identify missing values and found out that the dataset does not contain any missing records. Additionally, we checked for near-zero variance and highly-correlated features. None of the features had near-zero variance nor were they highly-correlated with one another (Figure 1). This means that none of the variables in the dataset is redundant.

Figure 1: Correlation matrix between attributes.

After this, the next step was to analyze the presence of outliers in the data. This was done by creating box plots for each attribute. Box plots provide a graphical representation of the data distribution and help identify visually any outliers. We observed that the attributeBMI had many outliers. This information was crucial to understand the data distribution and the potential impact of these outliers on the models performance.

In particular, we observed that some values of BMI were > 80, which seemed unusual given that BMI is defined as:

BMI = weight/height²

To understand this strange phenomenon, we consulted with some domain experts. The experts found these values to be highly unusual and suggested removing values with a BMI > 50 and converting the attribute type into an ordinal one. Following the expert advice, the observations with BMI < 15 or BMI > 50 were removed, and the attribute was mapped to a scale from 1 to 8 according to the following rules:

● BMI ≤ 16 “INANITION”

● 16 < BMI ≤ 17.50 “UNDERWEIGHT”

● 17.50 < BMI ≤ 18.50 “SLIGHTLY UNDERWEIGHT”

● 18.50 < BMI ≤ 25 “NORMAL”

● 25 < BMI ≤ 30 “OVERWEIGHT”

● 30 < BMI ≤ 35 “CLASS I OBESE”

● 35 < BMI ≤ 40 “CLASS II OBESE”

● BMI > 40 “CLASS III OBESE”

The transformation of the BMI attribute was suggested because it is an imbalanced index and doesn’t provide much information (in medical terms). It has been known to wrongly identify subjects who are very short or tall, or those who are muscular. In recent times, new calculations of BMI, like the “new BMI”, are preferred in the medical field. By transforming the BMI attribute into an ordinalone, more information can be obtained and the variability of the index is reduced. This provides a more informative and useful representation of the data.

Finally, we checked for the optimal subset of attributes. In order to find it, we applied the Boruta method [Kursa and Rudnicki (2010)] to perform feature selection in an R Snippet node. The Boruta method works by creating “shadow attributes”, which are random copies of the original features, and then comparing the importance of the original features with their corresponding shadow attributes. If a feature is found to be less important than its corresponding shadow attribute, it is removed from the dataset. This process is repeated until all features have been evaluated. The final subset of features is considered to be the optimal set of attributes for modeling.

As a result of the feature selection process, no attributes were found to be less important than the “shadow attributes” (Figure 2). This means that the feature selection process did not identify an optimal subset of attributes, and all of the attributes were considered important and used for the classification task.

Model Training and Evaluation

To model our binary classification task, we trained and compared the performance of a broad range of algorithms, namely the Decision Tree, Naïve Bayes, Random Forest, Gradient Boosting, XGBoost and a set of stacked models. All these algorithms are available as KNIME nodes.

Additionally, to thoroughly compare the models’ performance, we did not simply apply the algorithms with the default settings. We conducted hyperparameter tuning and cross-validation, instead. We relied on the Parameter Optimization Loop nodes to identify the best hyperparameters for each model using different search strategies (e.g., brute force, random search, etc.). We adjusted the number of iterations according to the computational requirements of the models, and made sure to obtain stable and robust predictions by using the X-Partitioner nodes for 10-fold cross-validation.

Here, for the sake of simplicity, we report only on the details of the model that performed best.

Gradient Boosted Trees

To solve our binary classification task, a Gradient Boosting model was included in the process of model comparison because it typically performs well thanks to its ability to effectively model complex relationships between the features and the target.

The parameters of the model were tuned using a random search with 1000 iterations, starting from a grid of possible values for the “number of models” [50, 150] , “learning rate” [0.05, 2], “maximum depth” [1, 10], “minimum child size” [50, 200], and “data fraction” [0.1, 1].

To do that, we built a simple KNIME workflow where each relevant hyperparameter in the Gradient Boosted Trees Learner node is optimized and validated across different data partitions.

Figure 3: Perform hyperparameter optimization and 10-fold cross validation.

We found out that the best hyperparameters leading to top model performance were the following:

● MaxLevel: 3

● MinChildSize: 150

● DataFraction: 1

● N° Model: 140

● Learning Rate: 0.13

But how to determine whether a model and the corresponding set of hyperparameters is actually better than the others? There exist many options to measure and evaluate model performance. For this project, we decided to use Log-Loss as the main metric.

Log-Loss

The evaluation of the models built for the challenge was conducted using a separate test set and the Log-Loss metric. Log-Loss, also known as binary cross-entropy loss, is a widely used performance metric in machine learning for binary classification problems.

Log-Loss measures the accuracy of a classifier’s predicted probabilities by calculating the likelihood of these predictions being correct. In other words, it evaluates how well the predicted probabilities match the actual class labels. A lower value of the Log-Loss indicates better performance.

The Log-Loss function is given by:

The average of the Log-Loss function is given by:

The calculation of Log-Loss for each individual instance can result in a problematic situation when the predicted probabilities are either 0 or 1. This is because of the logarithmic function in the Log-Loss formula, which results in an infinite value when taking the logarithm of 0. To address this issue, a small positive value is selected. This value is close to 0 but still within the range of the system’s handling capabilities and is used in place of 0. This ensures that the Log-Loss remains finite, avoiding any potential system errors. This approach is commonly used to deal with the issue of having Log-Loss values that are not computationally manageable.

The substitute value chosen for 0 is 10^(-15), and therefore, (1–10^(-15)) is used for 1. This substitution results in a Log-Loss of more than 34, which is a relatively high value but still manageable and acceptable for the binary classification task at hand.

Using the formulas above with the necessary adjustments, we determined the best hyperparameters for each trained model, and we were able to select the best model. In KNIME Analytics Platform, we can effortlessly apply probability adjustments using the Rule Engine node, compute Log-Loss for individual instances by using the Math Formula node, and the average Log-Loss using the GroupBy node.

Figure 4: Computing the Log-Loss formula.

Although Log-Loss is used as the primary metric in evaluating models, other metrics such as accuracy and the AUC(area under the ROC curve) are also used to provide a more comprehensive overview of the binary classification problem.

Training Workflow

The KNIME workflow that we designed to train, evaluate and compare the most effective binary classifier is illustrated in Figure 5.

We started off by importing the dataset and checking it for class imbalance. Next, we divided the dataset in two partitions, with 70% being used for training the models and the remaining 30% being set aside for testing. After partitioning, we started to process the dataset (i.e., missing value handling, check for near-zero variance, etc.). Mind that data preprocessing is done after data partitioning to avoid incurring the problem of data leakage.

Once the data was partitioned and preprocessed, different algorithms and models were trained and optimized on the training set and their performance was validated using 10-fold cross-validation. This helped determine the best parameters for each model. The hyperparameter search was performed by minimizing Log-Loss, as it was considered to be a key metric in evaluating model performance.

The model trained with the best hyperparameter was then applied to the test set. Besides Log-Loss, other performance metrics were also considered in the final evaluation phase. These included the area under the ROC curve and accuracy, which provided a more comprehensive view of the model performance.

The final model and algorithm were selected on the basis of a combination of these metrics, taking into account the overall performance in order to provide the most effective solution for the task at hand. For the sake of simplicity, the final workflow in Figure 5 includes the process covered for the best classifier only.

Download the training workflow for free from the KNIME Community Hub.

Compare Model Performances

Model performance on the test set was the most critical aspect of the project. In general, most classifiers performed well on the training set as they learned successfully the patterns and structures present in the data. However, the real evaluation of the model has to be performed on the test set, allowing us to also assess the model’s ability to generalize to new data.

Let’s have a closer look at the comparison of model performance. In Table 1, we can see that XGBoost and Gradient Boosting have the best performances in terms of Log-Loss.

Table 1: Comparing model performance by Log-Loss.

However, it is good practice not to base model evaluation on a single metric but rather on a combination of them to make sure to get the full picture. For this reason, we considered building the Receiver Operating Characteristic (ROC) curve using the ROC Curve (local) node.

The ROC curve provides a visual representation of the trade-off between TPR and FPR for different classification thresholds. It shows how well the classifier can separate the positive and negative classes. A perfect classifier will have an ROC curve that goes straight up the left-hand side and then straight across the top. The area under the curve (AUC) is a measure of how well the classifier is able to separate the classes.

Looking at the ROC curves in Figure 6, we can see that all the classifiers had good performances but the XGBoost and the Gradient Boosting outperform all the other models.

Furthermore, we can see that ROC curves of the two top performing models intersect in some points, implying that the classifiers have similar ability to separate the positive and negative classes, and that there is no big difference in their performances. This is confirmed also by the AUC value in the figure where Gradient Boosting is better only by a few points.

Figure 7 provides a graphical representation of the intersecting ROC curves for the different models. From the figure, we can see that the ROC curves for the two best models intersect in several points, even though the ROC curve for the best model should be consistently above all the others.

In Figure 7, we can also see that there is no optimal ROC curve for the entire interval. This implies that the models have different strengths and weaknesses, and there is no single model that is optimal for all scenarios.

After evaluating different classifiers, we found that XGBoost and Gradient Boosting performed best in terms of accuracy, Log-Loss, ROC curve, and AUC. However, based on our findings, Gradient Boosting slightly outperformed XGBoost in all evaluation metrics.

Model Deployment

The deployment phase is a crucial stage in the life cycle of machine learning models, as it deals with the actual use of the model in production with the aim of generating predictions on new data.

In this phase, it is crucial to consider that the data to be predicted does not possess the target labels, so it’s not possible to use a scoring metric to evaluate the model performance.

To develop the deployment workflow, we started off by importing new unlabeled data. We then applied the same preprocessing steps that we carried out during training, and imported the trained model using the Model Reader node. Finally, we generate predictions on the unlabeled dataset using the Gradient Boosted Trees Predictor node, and explore the results visually. In Figure 8, we can see that the model predicted the onset of diabetes in 59% of patients vs. 41% of patients that are not considered at risk.

Figure 8: Model predictions in the deployment phase.

It is worth noticing that, if the trained model performed satisfactorily during training, obtained predictions could be used to expand the training dataset with additional labeled observations. In this way, subsequent analyses can be performed and the trained model can be further improved.

Download the deployment workflow for free from the KNIME Community Hub.

Data App: An Interactive Tool for Early Diagnosis

Training and deploying a machine learning model for diabetes prediction is meaningful only if it can be easily and intuitively consumed via a sleek and friendly UI. To this end, we developed a KNIME Data App with the goal of creating an impactful experience for both regular users and experts in the field. Our application is divided into four main pages, each aimed at providing a comprehensive overview of diabetes and its early diagnosis.

The first page of the application is a detailed introduction to diabetes with important information on the causes, symptoms and consequences. This page was developed to provide a complete overview of diabetes to enable every user to fully understand the nature of the disease.

The second page of the application focuses on the model used for the diagnosis of diabetes, as well as the metrics used to evaluate the effectiveness of the model. This page was designed to provide users with a comprehensive overview of the diagnostic process and ensure proper understanding of how the predictions are generated and to what extent the model can be deemed reliable.

The third page is the heart and soul of the application. Here users can input their data (e.g., demographics, habits, etc.) to determine if they are likely to develop diabetes or not. This page was designed to be easy to use and understand, so that users can use the application intuitively and quickly.

Finally, on the last page of the application, users can observe the results obtained and, if necessary, can also modify the diagnostic threshold. Additionally, we also provided an option to show how to decrease the percentage of risk of developing diabetes through specific actions.

In summary, the design of the data app focused on providing a comprehensive overview of diabetes and its diagnosis in order to offer a useful and informed experience to users. We are confident that the application will be of great help to anyone interested in the prevention and management of diabetes.

Download the DataApp workflow for free from the KNIME Community Hub.

Conclusion

In this project, we conducted extensive training, testing and evaluation of various models to determine the most effective approach for tackling the binary classification problem of early diabetes diagnosis. Our primary goal was to identify the model that could accurately predict the presence or absence of diabetes in patients as early as possible.

To achieve this objective, we employed a meticulous approach, which involved carefully managing the data, selecting the most appropriate models, and carrying out a thorough evaluation of the chosen models to ensure good performance. Log-Loss was the primary metric employed to score and rank the classifiers. Gradient Boosting was the selected model, for it demonstrated exceptional performance on the test set outperforming all others classifiers. Hence, we concluded that the chosen model would perform well on unseen data. This means that it can also be relied upon to provide accurate and reliable predictions, an essential condition for developing an effective diabetes prevention tool.

Indeed, the creation of a powerful classifier was instrumental in developing a reliable diabetes prevention tool that people can use to take proactive steps to manage their health. By providing an accurate diagnosis and recommending precautionary measures, our web application could help people take the necessary steps to reduce the risk of developing diabetes.

To further improve our predictive tool, future work should focus on refining the model to increase its accuracy and reliability by means of exploring alternative modeling techniques, incorporating additional data sources, or conducting further testing and validation to ensure performance consistency across different populations and datasets.

Additionally, we could improve the UX of our web application by implementing user feedback mechanisms, streamlining the data collection process, and providing personalized recommendations. This could help increase user engagement and tool adoption.

Overall, we are optimistic about the potential impact of our project on people’s health and wellbeing but we acknowledge that there is room for improvement. We remain committed to exploring ways to enhance the accuracy, reliability, and usability of our tool to help people make informed decisions about their health.

References

● AashiDutt, S.G., 2022. Diabetes prediction competition(tfug chd nov 2022).

● Kursa, M.B., Rudnicki, W.R., 2010. Feature selection with the boruta package. Journal of statistical software 36, 1–13.

● Wolpert, D.H., 1992. Stacked generalization. Neural networks 5, 241–259.