Improved Solutions For Diabetes by Team 3

Warri
Budding Data Scientists
5 min readJun 3, 2018

Denzel Chia 3S2 (Leader), Guan Lin 3S1, David Lim 3S2

Introduction:

With the recent increase in our standards of living, it is, with no doubt, to claim that most of the human population will need to face many diseases caused by the overconsumption of nutrients, one of which, is diabetes. With over 422 million people having diabetes, 1 in 3 adults being obese with over 1.6 million people dying from diabetes 1, our project aims to use data science to discover the main causes of diabetes and present these data to the general public and medical institutions, find better, improved solutions on diabetes to help resolve it on a global scale and to inform the general public on how to avoid getting it. Should this project be successful, its scope will be extended to find improved and better solutions to other healthcare issues.

Literature Review:

Hippisley-Cox, J., Coupland, C., Robson, J., Sheikh, A., & Brindle, P. (2009). Predicting risk of type 2 diabetes in England and Wales: prospective derivation and validation of QDScore. Bmj, 338(Mar17 2). doi:10.1136/bmj.b880

The study looks into the risk of type 2 diabetes in England and Wales for people of different ethnic groups and between the age of 25–79.

This paper discusses the use of a new risk prediction algorithm for assessing the risk of developing type 2 diabetes among a very large and unselected population derived from family practice, with appropriate weightings for ethnicity and social deprivation.The algorithm (QDScore) is based on variables that are readily available in patients’ electronic health records to enable it to be readily cost effective.

An open cohort of patients aged 25–79 years at the study entry date is identified and calculated the crude incident rates of type 2 diabetes according to their age, ethnic group, and deprivation in fifths. Using a Cox proportional hazards model for the data, the authors estimate the coefficients and hazard ratios associated with each potential risk factor.Fractional polynomials is then used to obtain a non-linear risk relation. Interactions between each variable and age and between smoking and deprivation were tested and significant interactions were included in the final model. Multiple imputation were used to replace missing values for smoking status and body mass index.Multiply imputed datasets by using Rubin’s rules were also fitted to the model to combine estimates of effects and standard errors of estimates to allow for the uncertainty caused by missing data.

In the article, the authors concluded that a marked difference is found in the age standardised incidence rates of type 2 diabetes by deprivation, with a more than twofold difference for women when comparing the most deprived fifth with the most affluent fifth. Age standardised rates were also found to be significantly higher for men in every ethnic group compared with the white reference group, except for Chinese men. In women, age standardised incidence rates were higher for every group compared with the white reference group. This data are then combined into the QDScore which becomes a simple method to access the risk of diabetes.

To ensure their conclusion is valid, the authors validated the sample in another sample of separate practices and discovered that the QDScore has good discrimination and explains approximately 50% of the total variation in times to diagnosis of diabetes.The D statistic, which is a measure of discrimination appropriate for survival type data, was higher in the QDScore algorithm than some other researches. However, in order to support the issues raised, interactions between the variables and risk of diabetes were tested while only the significant interactions were included. In other to ensure there is sufficient data to provide a trend, multiple imputations were set to fill in the missing body mass index and other data.

Nonetheless, the author assumed that the patients who were not given insulin before the age of 35 have type 1 diabetes while the others have type 2 diabetes. Thus, this might affect the results as there might be patients having type 2 diabetes being given insulin before 35.

Despite so the authors work gave us a new method to predict the risk of type 2 diabetes in a very large and unselected group, with appropriate weightage for each factor. The algorithm also provides the risk of diabetes through different variables such as age, ethnic group, social deprivation and Body Mass Index.

Overall, the article enabled me to gain insight on use of data analysis tools in the field of healthcare and how it could be used to assess risk of chronic diseases such as diabetes. The article also provides insights on some methods in data analysis such as using fractional polynomials to model interactions between the variables that are non-linear

Methods:

As for our methodology, we will first research and find factors that lead to the cause of diabetes and find solutions to address such causes. Afterwards, we will perform extensive research and compare our ideal solution to current measures imposed by governments, before via the use of tools such as Tableau and RStudio, use data mining to find the best optimal solution. Should the project go well, we would extend its scope and select other healthcare problems, before finding common factors between the best optimal solutions and create an algorithm for it.

As for the data sources, we would go to websites such as www.data.gov, www.moh.gov.sg to obtain local data regarding diabetes. We would also research and obtain data from websites such as www.data.worldbank.org and www.who.int to view the current impact of diabetes on a global scale and to view measures imposed by other countries through other sources that are yet to be confirmed.

In other to sort data gathered, we would use tools such as RStudio and Tableau to analyse, sort and shortlist data, while using Python for webscraping when needed.

As for the roles of each member in the group, Denzel would compile and shortlist data, Guan Lin would come up with interpretations of the causes and effects while David would research for data from various sources.

Here is our current timeline:

--

--