The Big Data Murse
Before my rather unexpected decision to spend the summer studying data science at General Assembly, I had planned on starting nursing school this fall. The most fulfilling job of my life, which had originally led me to my decision to go into the healthcare field, was the summer I spent taking care of a developmentally delayed child with Type-I diabetes. I had to control his diet, monitor his blood sugar level, give him insulin shots, and constantly be on guard for other factors that might have an influence on his condition. The choice of switching careers was a big one, however, once I learned of the application of data science across so many parts of the healthcare industry, my decision was made. I was going to couple my interests and be a data scientist working in the healthcare field.
So when our teacher Matt announced an all-day hackathon (hack + marathon) making use of the famed NHANES data set, I was excited. The NHANES (National Health and Nutrition Examination Survey) data set is the gold standard in the medical field, it goes back to 1971, and participants fill out surveys on everything from socioeconomic status to lifestyle habits, as well as receive lab tests, dental examinations, and physiological tests. Because of the incredible length of the data, it allows us to view trends over time, something which is often lacking in data. It was also the first time that I had really worked with data that had any sort of medical element to it. Once my team and I downloaded the data-set we quickly realized that we would have to zone in on one particular problem that we wanted to solve or we would spend all eight hours just learning the layout of the dataset, which was incredibly detailed, and nonsensically labeled. I suggested to my team that we try to work with the diabetes information since I have slight subject matter knowledge.
We set our goal as determining which socioeconomic and health factors will be strong indicators if an individual will develop Type II Diabetes. Since most years of the data set didn’t include whether or not a person had Type I or Type II diabetes, and they are, for the purpose of what we wanted to study totally different, we decided to drop anyone who was diagnosed with diabetes before the age of 18, as that is usually indicative of Type I diabetes. We worked our way through the data science workflow, checking the integrity of our data, and looked for correlations between our many variables. We then built a logistic regression model determining how much of an impact the factors we chose would have on our outcomes. Since Type-II diabetes is largely due to the lifestyle choices of an individual, we kept in mind that many factors will have strong multicollinearity. If age is a determining factor in BMI and BMI is a factor in diabetes, placing both of these in your model will lead to issues, so we made interaction terms, which for the purpose of a model essentially combines the two factors, to account for those when we felt it was necessary.
Our model determined race as a key factor in developing the disease. Relative to the group that our group was comparing to, being African American gave you 20 percent greater risk of developing diabetes. Similarly, being Native American put you at nearly twice as much risk of developing diabetes versus the group we were comparing to. I would guess that this is most likely due to a mixture of genetics, cultural habits, as well as socioeconomic factors. Houses with incomes below 35,000 dollars had a 25 percent greater likelihood of developing diabetes versus those making more than that. On the opposite side of the income scale, those making above 75,000 dollars had a 10 percent lower likelihood of developing it. I would hypothesize that this is most likely due to a wider variety of healthy eating options available to those families with more disposable income. An individual’s BMI being above thirty, which is the level that determines obesity, naturally had a huge impact on risk, nearly tripling a person’s risk versus those who were not obese. Hypertension together with age, as well as several other factors that are generally determined by a person’s eating and exercise habits, were determining factors as well.
It was really cool to have been able to quantify these things which I had always heard, by actually doing my own analysis on the data. To top it off our team won the hackathon, and nothing says victory like a free round of drinks on the instructors.