Predicting Dengue Cases in San Juan, Puerto Rico and Iquitos, Peru

Step 1: Clean and Explore

I’ve recently decided to enter a data predictions competition to help propel my skills as a Data Scientist. Over the next few weeks, I hope to post progress and my rationale for all steps on the data analysis and machine learning process.

The competition is hosted by Driven Data, full description found here. With a background in global public health, I was immediately interested in a project that would predict the number of cases of Dengue Fever. A description of each feature (which are either satellite imagery measurements of vegetation or weather variables) can be found here.

I will post snippets of code, visualizations throughout these posts. Comments and suggestions are always welcome.

Cleaning: Missing and Imputations

  1. Split data into separate cities. We don’t want to impute data from one city to the missing cells of the other city.
  2. Find what data is missing in each city and see if there are any patterns.
  3. Impute something into those missing data cells. Because the data is time-dependent, imputation will be done with respect to closest available data by date.
Missing Data from San Juan and Iquitos

In Iquitos, we can see station_avg_temp_C and station_diur_temp_rng_c both tend to be missing on the same observation. Also, there are a few observations where the entire row of measurements (not including dates) are missing, and a few where only the weather measurements are gone.

In San Juan, nvdi_ne and sometimes nvdi_nw are missing. There is a large block around row 225 where all satellite data is missing. However, the San Juan weather data is more robust than the Iquitos weather data.

## Impute into Missing values
## see: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
## see: https://pandas.pydata.org/pandas-docs/stable/missing_data.html
## front fill and back fill allows for date sensitive filling
df_sj = df_sj.fillna(method = 'ffill')
df_iq = df_iq.fillna(method = 'ffill')

Pandas really makes life easy sometimes. However, it is important to note a few things:

  • First, we do not want to simply drop the data because there is valuable information elsewhere and we need to test our prediction against the actual value when testing out model. Second, time-sensitive models depend on information from days prior. Dropping data would mess with time series models.
  • Second, I used front fill, which uses the prior observation to impute. As we saw with the missing data visualization, some features had multiple missing values in a row (week after week). Taking the front to fill in all of them may be problematic. We could have used backfill, which would take the following observation to impute, but that would pose the same problem.
  • A better method could be to either take the average of the two to impute, or to impute the front for the first missing value and impute the back for the last value and repeat until they are filled.

The Model scores will tell which method is best. Future work will use all methods and the best performing imputation method will be selected.

Explore: Visualizing the Target

Big picture analysis first. What am I trying to predict? What does it look like over time? What trends pop out?

Cases in San Juan and Iquitos over time
#Plot Cases over time
for i in ['iq', 'sj']:
data = df[df['city'] == i]
data.plot(figsize = (12,5))
plt.title(str(i))
plt.xlabel("Year, Week of Year")
plt.ylabel("Number of Cases")

A few immediate observations:

  • The spikes in the time-series plots are obvious outbreaks. These will be important to predict (besides obvious public health reasons) because the scoring metric for this competition is Mean Absolute Error. Predicting just the cyclical trend of dengue will throw a large MAE back (although not as large as a Mean Squared Error).
  • The cyclical cycle for dengue (peaks and valleys) is more obvious visually in San Juan than Iquitos. This could be due partly to the fact that there are more cases in San Juan than Iquitos (see the y-scale).
#Plot Distribution of Cases
for i in ['iq', 'sj']:
data = df[df['city'] == i]
data.hist(bins = 100, figsize = (12,5))
plt.title(str(i))
plt.xlabel("Year, Week of Year")
plt.ylabel("Number of Cases")

A few observations from these two visualizations of dengue cases in both cities:

  • The distribution of cases per week of each year is not normally distributed. Each peak is very close to 0, and Iquitos has many more weeks with 0 cases than any other week. The time-series graph in Iquitos shows a long period (1990 to ~2001) with 0 or very few cases. Perhaps the features will explain why this period is so different from the
  • High values far to the right are part of the peaks in the cyclical trend or the outbreaks themselves. These rare events will be more difficult to predict.

Next Steps

In my next post, I plan to post more exploratory analysis of the features themselves. Before I can even begin modelling, it is important to know how the features interact and correlate with each other.