Applying Data Science to Epidemiology: modeling cholera outbreaks for the period 2010–2015

Miranda Gibbons
Healthcare in America
4 min readMay 30, 2017

--

As a final project for my data science boot camp, I decided to combine the skills I had learning during the course with my background in biology. At first, I wanted to look at biological agents that had been evaluated in threat assessments of bioterrorism, but found little to no data off of which to base my project. I then considered modeling outbreaks of Yersinia pestis, the bacterium responsible for Plague. Plague is a fascinating disease, which has been responsible for highly fatal epidemics throughout the course of history, however outside of periods of epidemic very few cases are recorded, and thus insufficient data yet again stood in my way (I would really like to pursue modeling Plague at a later date, given more data). I finally decided to look at cholera, a bacterial infection that is far more common in the developing world, and prevented most effectively by access to clean water.

Cholera is an intestinal infection carried by several strains of the bacterium Vibrio cholerae, whose main symptom is severe episodes of diarrhea and accompanying dehydration. The disease becomes fatal when treatment is not sought and dehydration progresses. Globally, there are approximately three to five million cases a year, of which 40,000 to 130,000 are fatal. As I mentioned previously, access to clean water and use of better sanitation are the main avenues of prevention. A vaccine also confers about six months of immunity, and oral rehydration packets are the WHO-approved treatment. Natural disasters, such as the earthquake in Haiti in 2010 and the floods in Pakistan that same year, often precipitate larger and more fatal outbreaks of cholera, due to displaced individuals and the inability of existing water infrastructures to support the aftermath.

I retrieved data on number of cholera cases, and deaths due to cholera, for each country by year from the World Health Organization (WHO). These datasets contained information from the 1950s to 2015, but not every country was represented each year. I decided to restrict my analysis to the period of 2010 to 2015 to capture the most accurate picture of cholera at a national and yearly level. I then used the World Bank’s databank to incorporate health and population statistics, such as population of each country by year, percent of the population with access to improved sanitation, number of physicians per 1000 citizens, etc. I found climate data, specifically temperature (average) and rainfall (sum) by month for each country through another World Bank data API. I combined these datasets to represent each country by year. Unfortunately, there was a great deal of missing data for the health statistics — the majority of the values for number of physicians and community health workers in each country was missing, as well as the amount of money allotted to healthcare. If I had had access to these measures, I would have been surprised if they had not been correlated with number of cholera cases in a given area. However, a large number of my explanatory variables were correlated with each other, so inclusion of measures such as life expectancy and improved sanitation may compensate for these missing predictors.

I was interested in looking both at number of cholera cases scaled by population, and the number of deaths due to cholera scaled by population. I decided to build a model predicting the former, partially due to my missing predictors. For each country, broken down by year, I scaled the number of cholera cases by 1000 citizens. I then broke down this measure into three categories, assigning the country/year a 1 if the rate of cholera was less than the “average” global rate of cholera (here defined as 0.17 cases per 1000 people), a 2 if the rate was “average” (between 0.17 and 0.55 cases per 1000 people), and a 3 if the rate of cholera was above the global average rate of cholera. These categories were extremely unbalanced — that is, each category was not comprised of one third of the observations. Class 1 observations comprised 81% of the observations, class 2 11%, and class 3 8% of the observations. In predicting whether a country for a given year would fall into one of these categories, the unbalanced nature of these classes needs to be taken into account. At first, I simply weighted each class to balance the predictions out, but ultimately discovered that I needed to weigh class 3 more heavily in order for my model to most accurately and precisely classify observations.

I built a number of different models with different explanatory variables and compared the performance of these models in order to determine which model would best conform to the data. Ultimately, a Random Forest Classifier outperformed Linear Discriminant Analysis and a Naïve Bayes Classifier in accuracy by several percentage points, and I fine-tuned which predictors to include/exclude from my available measures, determining by ranked feature importance that life expectancy, mean annual temperature, access to improved sanitation, access to clean water, and percent of the population practicing open defecation were the most effective predictors of size of a cholera outbreak. My final model performed with 81% accuracy, capturing 84% of the class 1 observations, 78% of the class 2 observations, and 57% of the class 3 observations. The performance of this model, and many other iterations of inferior models, suggests the difficulties in predicting particularly severe outbreaks, and I hope in further analyses to tease out those factors that precipitate the most severe outbreaks.

--

--

Miranda Gibbons
Healthcare in America

Data Scientist | Biologist | Aspiring Jeopardy Contestant