Noble Tomy Padiyara- NUS Business Analytics (Co Sponsored by School of Computing and NUS Business School), National University of Singapore
Tingyu Qu-NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore
Dengue fever has seen an abrupt increase in its infection rates in Singapore in the recent years. Though the underlying cause and carrier of the disease are the dengue virus and dengue mosquito respectively, the abrupt spike and dip in dengue rates in South East Asia and specifically in Singapore has been studied by a number of researchers.
As many factors are involved in the development , propagation and increase in dengue virus concentration , such as abrupt change in climate and increased population growth, developing a time sensitive dengue prediction system is still a global challenge. Hence a better forecasting model to understand the underlying relationships between various climatic factors and dengue is the need of the hour.
The objective of our study aims at developing a time sensitive dengue forecasting model which can predict the dengue cases 8 weeks or above in advance with high accuracy based on the historical dengue cases, geological temperature and rainfall samples, and population information in Singapore.
1. Domain Knowledge About the Dengue Issue in Singapore
1.1 Mosquito Life Cycle
Based on the studies from National Environmental Agency of Singapore, under optimal conditions in Singapore, the breeding cycle for a mosquito only takes around 7–10 days, after which the adult mosquito starts to take the blood from persons, then it will lay the eggs and the cycle repeats.
Thus, if the window length of the input features can match the mosquito life cycle, the periodicity of dengue breakout might be predicted with higher accuracy.
1.2 Temperature and Rainfall
Many studies highlight the impact of temperature on infectious diseases such as dengue[2–4]. It is expected that the variation of maximum and minimum temperature plays a crucial role in the increase of dengue cases. The cumulative rainfall, on the other hand, shows high correlations with the seasonal peak of dengue[4–5].
Thus, we look at the maximum temperature and the sum of the rainfall in each selected window in feature engineering. We also input the mean of both temperature and rainfall in a longer window to filter out noises.
1.3 Non-linear Effects of Mean Temperature and Rainfall
Several studies show that the non-linear impacts of climate on dengue risk should be considered for future dengue forecasting[6–7].
In this project, we look at the impact of the interactions between the temperature and rainfall on dengue cases, thus using the product of the weekly temperature and rainfall at each region as an input feature. Instead of using higher orders of the climate data, the cross-product of temperature and rainfall offers a relatively simple approach to reflect the non-linearity of the climate information on dengue cases.
Both the population size and growth rate have an important influence on dengue cases[8–9]. In Singapore, the frequency and magnitude of dengue epidemics have increased significantly over the past 40 years; therefore, it is very important to understand how the population drives the rapid increase of dengue cases in Singapore.
In this project, we look at both the mean and derivative of population during certain periods and select the best features after rigorous experimentation.
2. Data Extraction and Engineering
2.1 Data Extraction
The raw data is provided by Terra AI, which was compiled from data published by the National Environment Agency and Ministry of Health.
Both data extraction and modelling in this project utilise the Smojo Programming Language developed by Terra AI, which is accessible to all users and powers data cleaning, fusion, reporting and analytics work.
The data set contains 20 temperature profiles and 62 rainfall profiles (ordered by a daily-based interval), 1 population profile (ordered by a yearly-based interval) and 1 dengue-case profile (ordered by a weekly-based interval) collected in Singapore. All the temperature and rainfall data are aggregated from daily interval to weekly interval using dengue case as the driver. The population data are interpolated linearly from yearly interval to weekly interval using dengue case as the driver. As a result, each profile contains 1038 weekly-based values in a time-series sequence starting from 15/Jan/2000 to 23/Nov/2019.
2.2 Data Preparation
The percentages of missing values in all the samples were tabulated. It was observed that a number of features, especially pertaining to rainfall data had large amount of missing values up to the tune of sixty percent as shown in Figure 1.
Figure 1 | Percentage of the missing values in all the samples of the climate profiles.
2.2a Basic Correlation analysis
The correlation of each feature with the label using Pearson correlation method did not indicate any feature having a sufficient and direct correlation with the label. However, there is a clear seasonality in both temperature and rainfall profiles, which is expected by our domain knowledge.
2.2b Up Sampling and Reasons
Due to the large amount of missing values and limited records of dengue cases, we speculate that the data might not be sufficient for analysis, especially looking at the earlier records; therefore, the temperature profiles were up sampled using “spline” method to conserve the seasonality, which is able to replace all missing values with good approximations. The rainfall profiles do not show good approximations when using the same approach but can be well fitted using backward moving average which gives comparable rainfall data for each station. In both cases various other methods like mean, median, mode etc were also tried and the above mentioned interpolation gave the most comparable values to the prevailing climate of Singapore.
2.2c No More Missing Values
Via the aforementioned approach, the missing values of all temperature and rainfall profiles were replaced with good approximations, and the whole data set was re sampled to weekly-based sequence for analysis. The stationary nature of the time series was also analysed and confirmed using Augmented Dickey fuller test and Kwiatkowski–Phillips–Schmidt–Shin test.
2.3 Feature Importance using Random Forest
Random forest classifier with Gini coefficient was used to assess the features and the feature importance list based on P values were plotted and marked.
Despite different trials involving hyper parameter tuning, the temperature features overwhelmingly dominated the feature importance list show that compared to rainfall features, temperature features are more significant in the dengue prediction as shown in Figure 2. As indicated by our analysis and illustrated in various scientific journals[2–4, 6–7], gradual spikes of temperature over a few weeks have a close correlation with dengue outbreak, thus the temperature profiles should carry more weight in feature engineering. The population of Singapore is the highest ranked feature; however, domain knowledge indicates that the Singapore population has always been increasing. Hence, despite its ranking, the population of Singapore cannot be considered as the most important feature since it does not explain the swinging rates of dengue cases. Thus in feature engineering, the weight of this feature should be penalized.
Figure 2 | Analysis of feature importance. From left to right, the first 20 columns refer to the 20 temperature profiles from different regions in Singapore, then the following 62 columns refer to the rainfall profiles and the last column refers to the population.
From the feature importance plot ,it is amply evident that the population shows up as an important metric corresponding to the dengue outbreak. This also aligns with existing literature which overwhelmingly suggest that fast growing urban agglomerations are hot spots for a dengue outbreak.
Further, the feature importance plot also points at the relatively higher importance of temperature parameters over rainfall parameters.
2.4 Feature Engineering
When using Neural network, the ‘curse of dimensionality’ should always be in our mind to avoid incorrect analysis. Hence it is imperative that we select only the correct amount of features before our modelling.
Hence, we should decide upon the number of features to be selected. This was done via three steps.
At first, we decided that we cannot exclude rainfall features from the neural network since both temperature and rainfall are important factors in the dengue propagation cycle. Hence any number below 20 features cannot be considered since the first 20 important features as per feature importance plots are all temperature features alone.
Secondly, we did a cartographic analysis of Singapore and came to the conclusion that , as a small island the variation in rainfall all across Singapore as recorded in various weather stations are minimal at best. Hence we clubbed together seven important rainfall features which best represented the entire Singapore.
Thirdly, we normalized the entire data set between values -1 and 1 since Neural network performs best with normalized data. The training and testing datasets were split as periods between 2003–2017 and 2018–2019 respectively.
Fourth, instead of sending each temperature,rainfall feature data to the neural network, we created two windows of data. The major window of data consisting of a span of 13 weeks. This window was made for the neural network to understand the seasonality of the data. The sub window consists of a span of 4 weeks to mimic the entire breeding cycle and propagation of aedes egypti mosquito.
Deep Learning Neural Net Flow Chart
Finally, each window of these data consists of a single temperature feature which is the maximum among the 20 features ,a single rainfall feature which is the sum of all the 7 rainfall features. In addition the mean of the temperature and rainfall features are also included as another feature in the model to mimic the non linear relationship between temperature, rainfall and dengue propagation. Instead of population as a number, the population growth is taken as a feature since existing literature already points to the fast growing urban agglomerations as potential dengue flash points.
Our neural network is a differencing model in which the output of the neural network is added to the average of T-0,T-1,T-2 to T-8 cases to make a prediction 8 weeks in advance. The reason why we have averaged out the values from T-0 to T-8 is to reduce the “noise” around the graph while maintaining the ‘high’ values and the ‘low’ values intact. Hence we are able to compare the values 8 weeks later with a noise reduced averaged time series equivalent to T-4 values.
We have used Smojo programming language for our modelling . The meaning of each input feature is summarized below.
The code with instructions in Smojo are as follows.
A:9:7 MEAN \ label, avg for noise removal
A:8 \ test, should not be smoothed
A:0:-8 MEAN \ y0, avg for noise removal
A:0:-12 \ feature dengue-sg
A:0:-12 MEAN \ feature dengue-sg-avg
B:0:-2 MAX \ feature temp-max
B:-3:-5 MAX \ feature temp-max
B:-6:-8 MAX \ feature temp-max
B:-9:-12 MAX \ feature temp-max
B:0:-12 MEAN \ feature temp-avg
B:0:-12 RANGE \ feature temp-range
C:0:-2 SUM \ feature rainfall-avg
C:-3:-5 SUM \ feature rainfall-avg
C:-6:-8 SUM \ feature rainfall-avg
C:-9:-12 SUM \ feature rainfall-avg
C:-0:-12 MEAN \ feature rainfall-avg
D:0:-2 MEAN \ feature temp-rainfall-avg
D:-3:-5 MEAN \ feature temp-rainfall-avg
D:-6:-8 MEAN \ feature temp-rainfall-avg
D:-9:-12 MEAN \ feature temp-rainfall-avg
E:0:-12 DIFF DIFF MEAN \ feature population rate
3. Training and Prediction
3.1 Neural Network Specifications
We found that a network with 5 layers works the best for our model after repeated trials and hyperparameter tuning . If the network is too shallow, the training model does not fit the actual well while if the network is too deep, the model tends to memorise the training samples rather than generalise good prediction. We find the optimum layer-by-layer shrinking factor is around ⅔ and the the ideal perceptron size at the first layer is around 30 to 40 for our model.
3.2 Test Loss
The persistence loss in our normalized dengue case is 0.0082. The best test loss from 50 repeats is is 0.0045 as shown in Figure 3.
As we can see from the figure, we were able to beat the persistence loss by a relative margin of 45%. It should also be noted that we were able to achieve more then 45% margin in subsequent tests but limited ourselves to this test loss. Here we had to decide between a choice. If we were to continuously improve our test loss, it means that we will have to forfeit a “zero” lag claim and if we have to ensure zero test loss, we cannot improve the loss margin beyond 45%.
Judgement Call based on Problem Specification
Here our judgement call was that “zero” lag is more important than a few percentage point reduction is test loss. The reasoning behind this claim was that if we can accurately predict the “spike” in dengue outbreak with a “zero” lag, it means that the authorities can precisely hone and utilise their resources optimally in that precise time period.
The training and test predictions vs actual data are shown in Figure 4.
Figure 4 | Train prediction and test prediction. The training model almost fits all the actual data; while the test model can almost predict the accurate timing for the peaks and valleys despite the slightly erroneous fitting.
3.4 Lag Correlation
Figure 5 | Lag correlations. Both the training and testing data shows zero lags as shown in the figure.
3.5 Actual VS Model
As shown in Figure 6, the correlation between the actual data and our modelling data is almost linear, indicating an accurate time prediction about the peaks and valleys of dengue cases in Singapore.
Figure 6 | Actual VS model. The X axis corresponds to the training data and Y axis corresponds to the test data. Both the train and test models show a high correlation with actual data.
Our model was able to answer whether a dengue outbreak is likely and in which time period in the future is such an outbreak highly likely. This model can be further improved to pinpoint the locations of a potential dengue outbreak. However, this requires location wise data from each part of Singapore, especially from clusters where there are maximum dengue cases reported. In our exercise, we were provided only with one dengue feature and it was not sufficient to give a pinpoint accurate location wise analysis for a dengue outbreak.
5. Lessons Learned
It became quite evident that without sufficient domain knowledge, the analysis of time series data is unlikely to give fruitful results. Hence open research articles regarding dengue, rainfall and temperature were referred to understand the problem and its associated factors.
In addition, we referred to the historical weather data archives of Singapore to understand how average temperature has spiked throughout these years. For example, it was observed that “El Niño” factor was responsible for a brief spike in average temperature in the year 2000, 2016, etc.. It is also noted that the average temperature of Singapore in year 2019 is actually lower than that in year 2016. Such information coupled with other research articles gave us a good starting point to start our analysis.
In the entire exercise, it became quickly evident that data preparation takes the maximum amount of time and careful analysis of each feature. The one size fits all never works in a complex data set where the features behave independent of each other or when they have a nonlinear relationship.
The need for careful re sampling of data was also tasted since the existing data set was quite insufficient to make a model to predict an 8-week forecast of dengue outbreak. Even while up sampling, we learned that due care must be given so that the re sampled values should not end up having baffling values. For example, when up sampled by method = “linear” and “order = 2”, the temperature of the re sampled values often shot up to unrealistic values of temperature which are not prevalent in southeast Asia. The need for conserving the seasonality was evident while re sampling and replacing missing values. Although not perfect, seasonality factor was always taken into consideration mindful of changing seasons, rainfall patterns and temperature patterns in Singapore.
We really appreciate using Smojo as a major tool for the whole neural network modelling and predictions. The data extraction, including fusion, aggregation and interpolation through Smojo is quite efficient. Generating a long window with multiple time-series features by Smojo is also very convenient.
We really enjoyed the modelling process. Inspired by our domain knowledge, we select windows with a length of 13 weeks, which reflects the seasonality of the climate variations in Singapore. Furthermore, we also create sub-windows with a length of 4 weeks to mimic the breeding cycle of a mosquito. We emphasize on the average temperature and cumulative rainfall over a certain past period of time for the modelling, while also looking at the association of both the mean temperature and rainfall (the numerical product). Through many tests and comparisons, we found our approach worked better than those without a deep understanding about the nature of the dengue issues happening in Singapore.
Last but not the least, the entire exercise involved various trial and error methods in data preparation, feature engineering and modelling and through the ups and downs of trial and error, we acquired knowledge about how to address end to end data analysis.
We wish that at least one input from our analysis could help the scientists/researchers who are currently working on dengue and is looking forward for feedback from experts from this domain.
. Watts, Douglas M., et al. “Effect of temperature on the vector efficiency of Aedes aegypti for dengue 2 virus.” The American journal of tropical medicine and hygiene 36.1 (1987): 143–152.
. Lambrechts, Louis, et al. “Impact of daily temperature fluctuations on dengue virus transmission by Aedes aegypti.” Proceedings of the National Academy of Sciences 108.18 (2011): 7460–7465.
. Pinto, Edna, et al. “The influence of climate variables on dengue in Singapore.” International journal of environmental health research 21.6 (2011): 415–426.
. Chen, Szu-Chieh, and Meng-Huan Hsieh. “Modeling the transmission dynamics of dengue fever: implications of temperature effects.” Science of the total environment 431 (2012): 385–391.
. Xu, Hai-Yan, et al. “Statistical modeling reveals the effect of absolute humidity on dengue in Singapore.” PLoS neglected tropical diseases 8.5 (2014): e2805.
. Naish, Suchithra, et al. “Climate change and dengue: a critical and systematic review of quantitative modelling approaches.” BMC infectious diseases 14.1 (2014): 167.
. Esteva, Lourdes, and Cristobal Vargas. “A model for dengue disease with variable human population.” Journal of mathematical biology 38.3 (1999): 220–240.
. Twiddy, S. Susanna, Edward C. Holmes, and Andrew Rambaut. “Inferring the rate and time-scale of dengue virus evolution.” Molecular biology and evolution 20.1 (2003): 122–129.
. Struchiner, Claudio Jose, et al. “Increasing dengue incidence in Singapore over the past 40 years: population growth, climate and mobility.” PloS one 10.8 (2015): e0136286.