DengAI Competition Entry

Introduction

The goal of this competition is to accurately predict dengue fever outbreaks in two Peruvian cities: San Juan and Iquitos. The forecasting initiative was launched by the Obama Administration in 2015 and spans several federal agencies ¹. Dengue fever outbreaks, spread by mosquitoes, effect millions of people per year with epidemic spikes every 3–5 years that lead to significant loss of life². Better forecasts of outbreaks can aid policy-makers in efforts to allocate resources that combat the disease and head off outbreaks before they reach epidemic proportions.

The data provided consists of weekly dengue cases as well as a host of weather and climate variables measured by the US Navy, NOAA, and others. First, several exploratory steps are taken to assess data quality and attempt to identify any promising interactions between dengue cases and potential explanatory variables. Then a variety of models are attempted and evaluated based on mean absolute error (MAE) with the winning model to be submitted in the DengAI Challenge hosted by DrivenData. Code for all calculations and visuals can be found on github.

Exploratory Analysis

The data provided consists of weekly measurements of various ecological and climate data points for two cities in Peru. Fields starting in “ndvi” are satellite measurements of vegetation in proximity to each city ³. Several fields measure temperature, precipitation, humidity, and temperature range. Several of these variables exhibit strong collinearity because they measure the same target construct (ex: `station_max_temp_c` and `reanalysis_max_air_temp_k`). In those cases, one variable is excluded from the analysis.

Almost all of the missing data involves vegetation measurements around San Juan (for more on those measurements, read this). Sometimes the absence of data or the presence of an outlier can signal some risk that may impact the variable to be predicted. For example, perhaps faulty sensors that monitor vegetation levels signal larger governance problems that make it more difficult to quell the spread of dengue fever. New columns of data are added that flag missing values and outliers so they may be included in predictive models. Outliers are then deleted and all missing values are imputed using regression trees (from the mice package in R).

The resulting variable distributions for San Juan are presented below (a similar plot for Iquitos is here). This visual presents a scatter plot for each combination of variables, a density plot for each variable, and the correlation of each combination of variables. For example, weekofyear explains 28% of the variance in the number of fever cases. The correlation value is in the top-right corner of the grid, a scatter plot of the relationship is in the bottom-left, and distributions of each variable are in the top-left and bottom-right corners. Some variables exhibit obvious seasonality and there are still some collinearity problems, but mostly variables look decently distributed for use in models.

Decomposition is an exploratory technique for separating out seasonal fluctuation and trends from a time series. The charts below show separate line graphs for the overall data, the seasonal pattern in the data, overall trends, and whatever is left.

Decomposition of total fever cases for each city demonstrate clear seasonality corresponding to summers, which makes sense for a mosquito-borne virus. There may be a slight long term upward trend of outbreaks in Iquitos and a slight decrease over time in San Juan, but those trends are weak at best. The bottom chart in this series, remainder, is the variation in the time series that is not explained by long term trend or seasonality. These charts illustrate the unexpected peaks and troughs in fever cases.

Model Development

Four models are developed for this project — Ordinary Least Squares (OLS) regression, Negative Binomial regression, Holt-Winters Seasonal forecasting, and ARIMA forecasting.

OLS Regression

The first method attempted is an Ordinary Least Squares multiple linear regression model. A bi-directional step-wise variable selection algorithm is used to select variables that minimize the Akaike Information Criterion (AIC), a metric that seeks to measure the balance between predictive accuracy and overfitting (a phenomenon where the model works really well for the data it knows about and poorly for new data), for each city separately. Then linear models are fit using the resulting parameters.

Iquitos
The AIC minimizing linear regression model for Iquitos uses vegetation levels, dew point, humidity, temperature ranges, and the presence of precipitation outliers to explain variation in the number of fever cases. But the model doesn’t explain much with Adj. R² = .06. The model fits the typical seasonal trends nicely, but does little to call out the epidemics from 2011–2015. The model does suggest that higher dew point temperatures and vegetation southwest of the city predict increased fever incidence, while higher temperature ranges and humidity levels predict lower incidences.

Left: fitted vs. observed values for weekly fever cases in Iquitos using a linear regression model. Right: OLS regression coefficients for the AIC minimizing model in Iquitos.

Analysis of model residuals (the difference between modeled and observed values) illustrates clear violations of the assumptions that residuals should be normally distributed and have homoscedastic relationships with predictors. The model tends to have trouble when dew point and humidity are higher and when precipitation is not an outlier. Additionally, model performance is worse when vegetation levels are lower.

Distribution of model residuals (bottom row) compared to independent variables in the model

San Juan

The AIC minimizing linear regression model for San Juan explains 43% of the variation in the data with eight variables. Predicted values catch some of the spikes in fever incidence, but also raise some false alarms (such as the two over-predictions in 1993 and 1994). Additionally, the fitted values dip below zero at several points in the time series, which is unfortunately impossible for a count of fever cases. Perhaps a negative binomial regression model will be more appropriate given the distribution of the data available and the type of data being modeled.

Left: fitted vs. observed values for weekly fever cases in San Juan using a linear regression model. Right: OLS regression coefficients for the AIC minimizing model in San Juan.

The fitted values already illustrate serious problems with OLS regression applied to prediction of fever cases in San Juan, but residuals plots also illustrate problems with the assumption of homoscedasticity of residuals across the various predictor variables.

Distribution of model residuals (bottom row) compared to independent variables in the model

Negative Binomial Regression

OLS regression relies on a set of assumptions that the distribution of weekly fever cases does not necessarily meet. A more appropriate model would be a generalized linear model like Poisson regression or it’s cousin, negative binomial regression. Both methods are more appropriate for counts variables than OLS ⁴. Again, a bi-directional stepwise variable selection technique is used to minimize AIC across all possible models we could develop with this data set ⁵. This time the regression coefficient represents the log marginal change in total cases. Exponentiated coefficients (to reverse the logarithm) and standard errors are presented for each city.

Iquitos
The negative binomal model tracks actual fever incidence similarly to the pattern in the linear regression model and uses the same coefficients. Again, dew point and southwest vegetation drive the largest marginal effects, but unlike linear regression, all coefficient estimates are positive now

The residuals look a bit better than those from the OLS regression model. While residuals are not normally distributed, there are less clear relationships between errors in the model and the model’s input variables.

San Juan

This time many of the missing data and outlier flags are included in the AIC minimizing negative binomial model. Removing such flags resulted in a model that essentially just tracks the seasonality of fever outbreaks but misses all epidemic spikes. Removing just missing data flags fits the data slightly better. So maybe the absence of data does not really have any relationship with fever outbreaks. The biggest driver of fever outbreaks in this case is the presence of outlier-level temperature ranges.

In this model, residuals increased with increased temperatures and lower precipitation levels.

Holt-Winters Seasonal Method

Exponential smoothing means some number of past values are averaged to project one value into the future. Older values are down-weighted in this average. The Holt-Winters method contains functions for exponential smoothing of trend, level, and seasonality in a time series. This method is promising due to the obvious summer seasonality in fever incidence for each city ⁶. For implementation in the forecastR package, the data has to be condensed to the month level ⁷. Again, models are trained separately for each city, but in this model no external independent variables are considered.

Iquitos
Holt-Winters appears to fit the data quite well in Iquitos with the exception of a few early values that dip below zero and a lag. The pattern of fitted values traces that of the observed values, but slightly after the actual events. That’s like a weather man saying a hurricane is about to make landfall only after days of sustained winds and rain — helpful and probably accurate, but earlier notice is what we’re looking for! Residuals are mostly clustered around zero, but there are some outliers in cases of large spikes in fever incidence.

San Juan
The Holt-Winters model performs similarly in San Juan as it does in Iquitos. This makes sense because the model does not take external variables (like vegetation, temperature, etc) into account so environmental differences between the cities doesn’t impact the Holt-Winters predictions the way they have in previous models. Again, the residuals are mostly normally distributed but have heavy tails on either end of the distribution.

Seasonal ARIMA Models

ARIMA, which stands for Autoregression with Integrated Moving Averages, combines regression of lagged variables with moving averages to produce forecast estimates ⁸. Since this is essentially a regression process, the same independent variables from the OLS models are used as external regressors in the ARIMA models for each city.

Iquitos
The auto.arima() function in R returns an ARIMA(1,0,4) model following a stepwise model selection process. This process is not comprehensive, so there is a chance a better ARIMA model could be developed with more thorough explorations of the possible models, but this appears to work nicely from visual inspection of fitted vs. observed values. The timeline of fitted values traces the observed values as well as the Holt-Winters models did, but without the time lag. Residual values, as a result, are more tightly distributed. There are still outlier events (like the outbreak in 2011), but the model gets closer to fitting those outcomes than not.

San Juan

An ARIMA(1,1,1) model minimizes AIC for San Juan. Again, the fitted values trace the observed values much more nicely than any other model tested. Additionally, the errors produced by this model are fairly small and normally distributed.

Model Evaluation and Selection

Four model types have been attempted: OLS Regression, Negative Binomial Regression, Holt-Winters Seasonal Smoothing, and ARIMA. The DengAI competition mandate is to minimize mean absolute error (MAE). Holt-Winters made the least accurate predictions in each city likely because of the inherent lag in the forecasts from that method. Linear regression made a poor showing in San Juan, which makes sense because the linear regression model really only captured the seasonality of the data and San Juan saw some large spikes in fever outbreaks on several occasions. Failing to capture the epidemics in any way led to large errors there.

The final two models are a split decision. In Iquitos, the ARIMA model performed best whereas the negative binomial regression model best minimized MAE in San Juan. Recall that the initial AIC minimizing negative binomial model included several missing data flags, which were then removed manually to achieve a more sensible fit. It could be the case that missing data did not signal anything in the data and missing data flags should have been excluded from all models. San Juan could also just be harder to make any predictions for due to the large epidemic in 2011.

For practical application, deploying the negative binomial model in San Juan would require accurately forecasting all independent variables and then applying the model going forward. With the ARIMA models, since they were built for time series forecasting, R already has the functionality build in to make those projections. For that reason, ARIMA is the preferred model in San Juan as well as Iquitos despite the moderate improvement in accuracy.

Limitations and Future Work

One glaring limitation of the data is the lack of information on governance and infrastructure. Since dengue fever is transmitted by mosquitoes, it makes sense that a large amount of variance in fever incidence can be explained by environmental variables that make life easier or harder for mosquitoes. That said, data on the number and quality of medical service providers, access to hospitals, the economic health of each city, and other governance and lifestyle factors could be helpful links. Additionally, this project did not attempt an exhaustive list of forecasting methods. Future work could involve a more exhaustive search for the AIC minimizing ARIMA model and development and testing of GARCH, neural network, vector autoregression, and other models from the course.

Conclusion

The goal of this project is to accurately project the spread of dengue fever in two cities in Peru. Four models were fit for each city and evaluated based om mean absolute error. The ARIMA model is used for each city to compile a submission for DengAI.

Citations

¹ https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/page/81/
² https://obamawhitehouse.archives.gov/blog/2015/06/05/back-future-using-historical-dengue-data-predict-next-epidemic
³ https://earthobservatory.nasa.gov/Features/MeasuringVegetation/measuring_vegetation_2.php
⁴ Hoffmann, John P. Generalized Linear Models: an Applied Approach. Pearson/Allyn & Bacon, 2004.
⁵ Zhao, Lili; Wu, Weisheng; Feng, Dai; Jiang, Hui; Nguyen, XuanLong. Bayesian Analysis of RNA-Seq Data Using a Family of Negative Binomial Models. Bayesian Anal., advance publication, 8 April 2017. doi:10.1214/17-BA1055. https://projecteuclid.org/euclid.ba/1491616976
⁶ Gelper, S., Fried, R., & Croux, C. (2010). Robust forecasting with exponential and Holt–Winters smoothing. Journal of Forecasting, 29(3), 285–300.
https://stackoverflow.com/questions/28550208/why-ets-function-in-r-is-not-fitting-a-seasonal-model
⁸ Jarrett, J., & Kyper, E. (n.d.). ARIMA Modeling with Intervention to Forecast and Analyze Chinese Stock Prices. International Journal of Engineering Business Management, 3, International Journal of Engineering Business Management, 2011, Vol.3.
⁹ Hyndman, R.J. and Khandakar, Y. (2008) “Automatic time series forecasting: The forecast package for R”, Journal of Statistical Software, 26(3).