Member-only story

Dengue Fever Prediction

Published in

Analytics Vidhya

5 min readMar 6, 2020

Subscribe to my weekly newsletter here! ✍️✉️

Dengue viruses are spread to people through the bite of an infected Aedes species (Ae. aegypti or Ae. albopictus) mosquito. Dengue is common in more than 100 countries around the world. Forty percent of the world’s population, about 3 billion people, live in areas with a risk of dengue. Dengue is often a leading cause of illness in these areas.

I’ll be using data from San Juan, Puerto Rico and Iquitos, Peru to predict the total cases of dengue fever infections for each week. Let’s start out by looking at the total cases of dengue plotted against a time series.

As we can see above, we have 18 years worth of data for San Juan (1990–2007) but only 10 years for Iquitos (2000–2009). To combat this, I split the data into 2 groups (after splitting the training data into a validation set to avoid leakage) based on which city the data belonged to. It’s also hard to see any real correlation on the plot above so I engineered a “month” feature in order to get a better understanding of when infections are most likely to occur. This feature happened to be the most important of any feature in the San Juan data as shown in the plot further down.

The evaluation metric I’ll be using for my models is mean absolute error.

I’ll be using this as my evaluation metric because it penalizes outlier values less harshly than other metrics like mean squared error (MSE). This is beneficial because we see large spikes in the number of infections in the plot above and we want to be able to anticipate these spikes as well as possible.

This plot gives us a better understanding of our data and shows that infections in San Juan start to rise around July and decline starting in November, while in Iquitos we see infections begin to rise in August…