Predicting the number of dengue cases 8 weeks after

YS Koh
10 min readJan 8, 2020

--

(This article is written for IoT Datathon 3.0 and prepared by the group “upper outlier”: Chua Xin Ying, Koh Yen Sin, Yang Qian Yu, Dr Zeenathnisa D/O Mougammadou Aribou and Zhang YunJue. )

What is dengue fever

Dengue fever is a mosquito-borne disease that is transmitted by the Aedes mosquito, the life cycle of which is affected by temperature and rainfall (Figure 1). In periods of high temperature, the egg laying time of Aedes mosquito decreases, which lead to an increase in the number of Aedes mosquitoes. In addition, it is an infection that is more prevalent in the urban areas. As a country with a tropical climate and metropolitan setting, Singapore has been plagued with dengue fever over the years. As of 28 December 2019, 15,999 dengue fever cases were reported, five times more than 2018 the prevalence is still rising.

Figure 1: Dependency chart

61 Dengue cluster were reported to be active, as shown in Figure 2.

Figure 2: Dengue clusters as shown in purple

Why is dengue a problem in Singapore

With climatic changes and surge in global temperature, it is expected that Singapore will experience more dengue cases in the future. Given the large impact on social, economic, and health burden of dengue, it is important to accurately forecasting the number of dengue fever incidences so that resources could be devoted during the period to prevent the number of incidences from reaching the epidemic threshold of 190. With the above situation, we sought to generate a model that is able to forecast the number of dengue cases 8 weeks after with open source dataset, such as the weekly dengue fever incidence, weather data and population data.

Data Pre-processing

The dataset given for the challenge were the number of dengue cases per week, the daily temperature and daily rainfall captured by the weather stations and mid-year annual population. In addition, monthly mean relative humidity data were extracted from https://data.gov.sg/. Based on the data given, the following variables related to temperature and rainfall were generated:

As most of the weather stations did not have data prior to 2010, the data used to create the model starts from 2010 onwards. Population data was interpolated to weekly data and since the population data were collected mid-year, the data used to generate the model end on June 2019. Monthly mean relative humidity was also interpolated to weekly data.

The dataset was split into training dataset (from 2010 to 2016) and test dataset (from 2017 to June 2019). The number of dengue cases were normalised using the following formula:

Note: the median, maximum number and minimum number were from the training dataset

Particularly for test dataset, the data were normalized using the median, maximum number and minimum number of dengue cases from the training dataset. This is done as the prediction generated from the model is scaled using the training dataset.

Evaluation of models

We relied on three metrics in decreasing importance to determine a good model: a lag correlation of 0 or less, a model with predictions that follow the trend of the actual values and having test loss that is lower than persistence loss. Firstly, with lag correlation of 0 or less, it implies that the model does not rely on the current values directly in determining its prediction. Secondly, having a model that can forecast the trend is important, as the aim of forecasting is to predict the trend instead of the actual values and the information of an upward trend is meaningful in decision making for implementation of vector control strategies. With that in mind, we put the criteria of having a model with a test loss that is lower than the persistence loss as last, even though having a lower test loss implies that the model is able to predict better as compared to merely using the current number of dengue cases.

The persistence loss, which is the mean squared error when current value of dengue cases was used to predict dengue cases 8 weeks after, was calculated using the following formula:

The persistence loss of the test dataset is 0.00457, which is the benchmark to assess if the models could reduce the value based on their predicted value.

Exploratory data analysis

Figure 3 shows the trend for number of dengue cases, mean temperature, mean rainfall, mean relative humidity and population from 2010 to June 2019. For dengue cases, there were 4 peaks with more than 600 dengue cases. These were revealed in Figure 4 to be in the following weeks: 25th week of 2013 (842 cases), 27th week of 2014 (819 cases), 3rd week of 2016 (624 cases) and 28th week of 2019 (647 cases). For mean temperature, it shows a cyclical pattern despite the noise, while for mean relative humidity and mean rainfall, it does not reveal any pattern in their trend. For population, it depicts a relatively linear increase.

Figure 3: Trend for dengue cases, mean temperature, mean rainfall, mean relative humidity and population
Figure 4: The number of dengue cases for the various years

Cross-correlation was also calculated to determine the strength of correlation between the lagged values of each variable and the target prediction (dengue cases 8 weeks after). The below table shows the lagged periods of each variables having strong correlation with the target prediction, which was calculated using ccf() function in R. This allows us to generate potential features that are good predictors for the model.

XGBoost

Using the dataset, we utilized XGBoost algorithm to generate our model, which is an ensemble learning technique that uses the concept of boosting and bagging. Bagging is the combination of many decision trees to give an accurate and stable prediction by averaging the result of all the decision trees (Figure 5). Its aim is to reduce the variance, which means that the prediction is consistent if the training dataset is used to generate another model again.

Figure 5: Illustration of bagging concept

On the other hand, boosting builds decision trees in an iterative manner, such that the subsequent trees learn and update the residual errors by learning from their predecessor trees. This is because the initial decision trees created are weak learners as their bias is high, which means that the difference between the actual values and the predicted values is high. By combining the important information from all the decision trees through XGBoost, it generates an overall strong learner that is able to reduce both bias and variance.

Figure 6: Illustration of boosting concept

In selecting the features for prediction, we employed the package “tsfresh” from Python. The package “tsfresh” is an acronym for “Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests”, which is able to automatically extract features that characterizes the time series, such as maximum and minimum values, based on hypothesis testing. The usage of such a function is appealing as we do not have to try the features many times to obtain a good model.

Using the XGBoost function in Python, the feature importance are generated as shown in Figure 7. The training loss values (0.0453) and the test loss values (0.0544) are visualized as below in Figure 8. As both values are close, it implies that the model is unlikely to overfit the training data. Figure 9 and 10 show the prediction provided using the model with training and test dataset. While the use of XGBoost algorithm is able to forecast the peaks at the correct timing, it seems like there is a lag at the later part of the test prediction. This is further supported by the lag correlation plot (training and test lag = 1) in Figure 11.

Figure 7: Feature Importance plot from tsfresh package
Figure 8: The graph of loss function from training and test dataset
Figure 9: Prediction from the training dataset. Original is the actual values 8 weeks after. Baseline is the forecast by using the number of dengue cases. Predicted is the forecast from the model.
Figure 10: Prediction from the test dataset. Original is the actual values 8 weeks after. Baseline is the forecast by using the number of dengue cases. Predicted is the forecast from the model.
Figure 11: Lag correlation plot

Improving the model above

As it seems that the above model has a lag for the test dataset prediction, we attempted to manipulate the available features manually to examine if we could reduce the lag. Using the function xgb.train() in R, we include the following features in the final model:

To further reduce overfitting, L2 regularization was used and the learning rate was 0.03. In addition, we attempted to forecast the difference between the current dengue cases and the dengue cases 8 weeks after, as this method allows us to have a better forecast especially since the spread of dengue depends on weather seasonality. The difference is later added back to the current dengue cases to generate the training and test prediction plot for easy interpretation.

The below figure shows the variable importance plot, showing that the top 5 crucial variables are dengue fever/population, dengue fever/population 1 week before, mean relative humidity, average of max temperature detected 11 to 20 weeks before and dengue fever/population 4 week before.

Figure 12: Variable importance plot

The training loss value is 0.00148 and the test loss value is 0.00460. The close proximity of the two values shows that the model is unlikely to over fit. Even though the test loss value is slightly higher than the persistence loss and Figure 14 shows that the forecast is not perfect, the plot does show that the model is able to predict increasing trend to a certain extent, in particular the later part of the test prediction graph. In addition, the lag correlation of both training and test dataset are 0, as determined by ccf() function in R.

Figure 13: Training dataset prediction
Figure 14: Test dataset prediction

Evaluation of our final model

One great advantage of using tree-based models is that it reveals feature importance and our models are then further restricted to the features that are most critical in predicting the dengue cases. We attempted to use neural networks to answer this forecasting question and although we were able to beat persistence loss, the prediction continued to have a lagged correlation of 1–2 weeks. Furthermore, neural network is unable to delineate the feature importance as with the tree-based models. Hence, we have chosen to proceed with the tree-based model.

Based on the above two models, we would like to choose the first model as our final model. Even though there is a lag of 1, the model is better in forecasting the trend and this allows us to determine whether the number will be high during that period. In both models, we have observed that it is difficult to obtain a test loss lower than the persistence loss and a good prediction of trend is sufficient to make decision with regards to taking steps to avoid the threshold of 190 cases.

Despite the fact that our final model is able to forecast potential increase in dengue cases, it is not without limitations and there are risk in accepting the forecast. Firstly, the dataset is quite small. Hence, it is difficult for us to have a validation dataset to test our model on and show that our model is able to forecast the trend with a novel dataset. Our model could have been over-optimized for the test dataset, especially when we had tried many times to generate a model. Secondly, the features generated may not be enough for the forecast, which may be revealed by the fact that our model did not have a test loss that is lower than persistence loss. Other studies have shown that population density and traffic conditions also contribute to the spread of dengue fever. Hence , future model may consider utilizing these information if they are made available.

In conclusion, our final model is able to predict the upward trend of dengue cases in Singapore and importantly without lag. An upward trend and spread of dengue cases can be predicted 8 weeks prior and this information can be combined with cost-benefit analyses by domain experts, to determine the epidemic threshold for dengue cases before vector control strategies are implemented.

(Side-note: In the process, an app was developed by us using this algorithm in order to visualize the variable importance plot and the model. It is created using RShiny and hosted in the website: https://koh-yen-sin131.shinyapps.io/XGBoost_dengue/. It may take a while for the result to load and the features used in the second model is already included by default. Another app using random forest was also created as the variance importance plot and whether the features could create a decent model could be generated more quickly: https://koh-yen-sin131.shinyapps.io/random_forest_dengue/. We want to show that feature selection is indeed not an easy process.)

--

--

YS Koh

I am interested in using R programming for the field of epidemiology and biostatistics.