Prediction of Onset Diabetes using Machine Learning Techniques

From the huge list of chronic illness, “diabetes” is one of the illness that tops this list and has been a major concern for more of the individuals due to a massive increase in diabetics patients these days. Diabetes can be caused by body’s inability to produce or when the body cannot use the insulin that it produces.

The effects of diabetes mellitus include long-term damage, dysfunction, and failure of various organs (WHO). As a result, it has significantly increased mortality in patients. There are mainly two types of diabetes: Type I (T1) and Type II (T2). T1 occurs when the body is no longer able to produce insulin whereas T1 is common in childhood and also known as juvenile diabetes. This form of diabetes is less common; only about 5–10% of people with diabetes have T1 (American Diabetes Association, 2010). T2 occurs when the body is unable to utilize the insulin produced or not enough insulin is produced. In addition, there is another type of diabetes named gestational diabetes which develops during pregnancy. Too much glucose in the blood can damage eyes, kidneys, and nerves. It can also cause heart disease, stroke, and insufficiency in blood flow to legs. Overweight, lack of exercise, family history and stress increased the possible risk of diabetes. In Bangladesh, people are not conscious about health. There is 7.1 million case of Diabetes in Bangladesh. The increasing level of Diabetes is unbound. People do not know about it and they do not go to check it.

Diabetes has affected over 246 million people worldwide with a majority of them being women. According to the WHO report, by 2025 this number will expect to rise over 380 million.

Data mining techniques to predict diabetes risk.

Naive Bayes (NB)

Naive Bayes classifiers assume attributes have independent distributions. It is considered to be fast and space efficient. It also provides the simple approach, with clear semantics, representing and learning probabilistic knowledge. It is known as Naive because it relies on two important simplifying assumptions. The predictive attributes are conditionally independent and secondly, it assumes that no hidden attributes bias the prediction process. It is very fast to train and fast to classify.

Logistic Regression (LR)

Logistic regression is a type of probabilistic statistical classification model for analyzing a dataset in which there are one or more independent variables that determine an outcome. In logistic regression, the dependent variable is binary or dichotomous, that means it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, nonpregnant, etc.). Logistic regression generates the coefficients of a formula to predict a log its transformation of the probability of the presence of the characteristic of interest.

Multilayer Perceptron (MlP)

MLP is one of the most widely used neural network classification algorithms. This classifier uses backpropagation to classify instances. The main problem with this algorithm is that prediction given by MLP is difficult to understand and explain by a human being. MLP used in this experiment consisted of four layers: one input, 2 hidden layers, and one output layer.


In Weka, the nearest neighbor classification algorithm is known as IBK (the IB stands for Instance-Based, and the K allows us to specify the number of neighbors to examine). IBK is a useful data mining technique that allows us to use past data instances with known output values to predict an unknown output value of a new data instance. It predicts very accurate but often performs slow, generally perform well for large value ok K.

Decision Tree (J48) and Random Forest

Decision tree algorithm initially defined as the C4.5 algorithm, Weka classifiers packages has its own version of it known as J48. J48 is an optimized implementation of C4.5. Random Forest consists of many decision trees and the method combines bagging and the random selection of features idea both together. It is one of the best learning algorithms available in machine learning and produces a highly accurate classifier. It can handle huge amount data and run efficiently without variable deletion.

There is no cure for diabetics but early detection can reduce the long-term complications and reduce the cost. Millions of people in the world have diabetes. Many of these people do not even know whether they have it or not. The ability to predict diabetes early plays an important role in the patient’s appropriate treatment strategy. However, the correct prediction accuracy of current machine learning algorithms is often low. LR performed the best among all 10 classifiers. It tried to predict whether an individual was diabetes positive or not. Thus, this article applied several machine learning algorithm and analyzed the data for enhancing the diabetes prediction accuracy. Further analysis of attributes and different combination of feature selection is required to achieve better accuracy. The outcomes might help the care process in the low resource settings. It also helps for preventive care of diabetes patients.

So, what you are waiting for? Get your hands on this awesome tutorial for you to get started with machine learning in no time. This course provides a broad introduction to machine learning, data mining, and statistical pattern to recognize diabetes onset detection with deep learning grid search.

Projects you will learn in this tutorial:

  • Stock Market Clustering
  • Breast cancer malignancies
  • Diabetes onset detection
  • Credit card fraud detection
  • Predicting board game reviews

Back us now on Kickstarter & grab some amazing early bird deal now!