Prediction of epidemic disease dynamics using machine learning

Abhijit Gupta
Intel Student Ambassadors
9 min readFeb 1, 2019


Reliable predictions of infectious disease dynamics can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. With the big data growth in healthcare and biomedical sector, accurate analysis of such data could help in early disease detection and better patient care. With the availability of huge computational power at hand, it is now very much viable to exploit the ‘big data’ for predicting and managing an epidemic outbreak. Our idea is to analyse and determine the spread of epidemic diseases in villages and sub-urban areas, where healthcare might not be readily available. We want to build a machine learning model that could predict the epidemic disease dynamics and tell us where the next outbreak of epidemic would most likely be. Our approach takes into consideration the geography, climate and population distribution of an affected area, as these are relevant features and subtly contribute to epidemic disease dynamics. Our model would be beneficial for the healthcare authorities by assisting them take the appropriate action in terms of assuring that enough resources are available to suffice the need and, if possible, curbing the occurrence of such epidemic disease.

Link to GitHub repository

Use of Intel Technology: Intel Distribution for Python, Intel Parallel Studio XE, Intel VTune amplifier, AWS C5.2Xlarge Intel instance

Broad Objective:

  • Curbing the preventable disease-related suffering
  • Minimize financial burden on governments and health care systems by providing them first-hand information about outbreak prone areas and causative agents for the spread of epidemic
  • Given an area where an epidemic outbreak has occurred, our ML model should be able to identify next outbreak prone areas and identify features which contribute significantly in the spread of the outbreak
  • Epidemics of infectious disease are generally caused by several factors including a change in the ecology of the host population,
  • Change in the pathogen reservoir or the introduction of an emerging pathogen to a host population.
  • The feature vectors in our model are general enough to be adapted with a slight change to study any epidemic disease.

Topic of case study: 2015–2016 Zika virus epidemic

Why Zika?

● Zika Data Repository maintained by Center for Disease Control and Prevention contains publicly available data for Zika epidemic. It had enough data for building and testing our model.

● Epidemics of infectious disease are generally caused by several factors including a change in the ecology of the host population, change in the pathogen reservoir or the introduction of an emerging pathogen to a host population.

● The feature vectors in our model are general enough to be adapted with a slight change to study any epidemic disease

Implementation Details

We have used Intel Distribution for Python* and Python API for Intel® Data Analytics Acceleration Library (Intel® DAAL) — named PyDAAL — to boost machine-learning and data analytics performance. Using the advantage of optimized scikit-learn* (Scikit-learn with Intel DAAL) that comes with it, we were able to achieve good results for the prediction problem.

Data Sources

The web scraping scripts retrieve weather data, population density, vector agent population, economic profile, etc. and construct a Dataframe which is then fed into our ML model
  • Zika Data Repository maintained by Centre for Disease Control and Prevention contains publicly available data for Zika epidemic. (
  • Google Geolocation API for procuring latitude and longitude of places associated with outbreak
  • Worldwide airport location data retrieved from Falling rain
  • The weather data scraped from by nearest airport code
  • Population density of different regions was extracted from gridded map via NASA (SEDAC) (
  • Vector agents (Aedes albopictus, Aedes aegypti) occurrences from The global compendium of Aedes aegyptiand Ae. albopictus occurrence (
  • GDP/ GDP PPP data from IMF World Economic Outlook

The repository contains Jupyter notebooks that implement methods pertinent to cleaning and munging data.

The Evaluation outcome is the likelihood of an area to have an outbreak.


Preprocessing and adjusting class imbalance:

Data pre-processing involves the transformations being applied to the data before feeding it to the algorithm. Since a few of the variables in the dataset are categorical, various techniques need to be applied for converting the categorical to numerical variables. Particularly, for Zika cases reported in the CDC database, there was huge class imbalance that became apparent during preliminary analysis

This was due in part to the fact that most locations did have outbreaks and most of these outbreaks were ongoing (present at all dates) throughout the span of the available data.

To attempt to remedy this we made two frameworks and both were tested to balance the classes and make prediction easier:

Framework A: Locations used in the non-outbreak class were considered to be those that had never had an outbreak, and the feature information from the first available date was used for this data. The outbreak class was defined as a location which had an outbreak at any time during the analyzed dates. For these locations, features from two different dates were tested: those from date at which the outbreak began and the date at which the outbreak had reached its maximum level (during the span of data collection). These two data sets were called framework_a_first and framework_a_max, respectively.

Framework B: Only data from locations which had an outbreak were used. This data was then split into the first date available, assuming no outbreaks were present. This data was used as the non-outbreak class. Then, from the timeseries for these points, features from either the date at which the outbreak began or the date at which the outbreak had reached its maximum level (during the span of data collection) were used for the outbreak class. These two data sets were called framework_b_first and framework_b_max, respectively.

We found that framework_a_max gave the best result of the four. Intuitively speaking, that’s the time period point where the disease could really be called epidemic. Once the spread has reached a threshold, then it rapidly spreads to other areas nearby, which is what our primary concern.

Feature Selection

Data sets may contain irrelevant or redundant features that might make the machine-learning model more complicated. In this step, we aim to remove the irrelevant features which may cause an increase in run time, generate complex patterns, etc. The generated subset of features is used for further analysis. The feature selection can be done either by using Random Forest or Xgboost algorithm. In our project, the Xgboost algorithm is used to select the best features which has a score above a predefined threshold value. Our findings agree with the literature on Zika epidemic1 The temperature, rainfall, proximity to mosquito breeding area, population density and vicinity to other places with large human population(measured via airport_dist_large) plays a significant role in spread of epidemic.

Data Split

Splitting the train and test data: The data is then split into train and test sets for further analysis. 70% of the data is used for training and 30% is for testing. The StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0) function in scikit-learn is used for data splitting. Stratified Splitting is required to handle class imbalance between Zika cases and non- zika cases. Stratified splitting maintains the ratio of positive and negative cases of the total sample in train and test sets.

Model Building

scikit-learn with Intel DAAL

Balancing the data set

The dataset is highly imbalanced with 86% of the data containing positive Zika cases. This data imbalance is handled by the SMOTETomek(SMOTE + Tomek)* algorithm, which generates the new smoted dataset that addresses the unbalanced class problem. It artificially generates observations of minority classes using the nearest neighbors of this class of elements to balance the training dataset. It combines over- and under-sampling using SMOTE and Tomek links.

Model Building and Training

In this stage, machine-learning models are selected for training. All classifiers in scikit-learn use a fit (X, y) method to fit the model for the given train data X and train label y. To compare the performance of various models, an ensemble of classifiers is used. Once the model is trained, it can be used for prediction. We tested ADABoost, XGBoost, SVM, Multi Layer Perceptron, Logistic Regression.


During this stage, the trained model predicts the output for a given input based on its learning. That is, given an unlabeled observation X, predict (X) returns the predicted label y.


In order to measure the performance of model, various performance evaluation metrics are available. We have used accuracy, precision, and recall as our evaluation metrics to choose the best model for the problem.


We obtained excellent scores for the best estimator — XGBoostClassifier on both Stratified five fold cross validation(0.96) and accuracy(0.95) on the test set. Other relevant metric are presented below.

Prediction Accuracy: 95%
ROC with 5-fold cross validation

Code Optimization

Intel Distribution for Python offers Intel® Math Kernel Library (Intel® MKL) accelerated packages like NumPy, SkLearn, etc. PyDAAL — to boost machine learning (ML) and data analytics performance. Further, we made use of Dask. Dask.distributed is a lightweight library for distributed computing in Python.

We made use of it in our core program, which was to do GridSearchCV for tuning hyperparameters of the best estimator model — XGBoost (Gradient Tree Boosting). We used it as it offers a distributed gradient boosting library designed to be highly efficient, flexible and portable.

XGBoost classifier is run with the parameter njobs = -1(for using max threads)

To implement it we launched a local Dask.distributed client on our local machine

import dask
from dask.distributed import Client
client = Client() # without parameters means running locally
from dask_ml.model_selection import GridSearchCV, RandomizedSearchCV

This offers a substantial improvement in performance when we have to implement a complex pipeline that applies a series of transformations on the input data like (Normalisation, PCA, etc.)

We tried to enable threading composability between two or more thread-enabled libraries. Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.

Substantial improvement is observed when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL and others which in turn are parallelized using Intel MKL or/and Intel® Threading Building Blocks (Intel® TBB).

All scripts were run with the modifier flag -m tbb which enables Intel Threading Building Block

Example: python -m tbb /path/to/your/code

Unlocking Composable Parallelism in Python via Intel TBB

•Enabling threading composability between two or more thread-enabled libraries

•Threading composability accelerates programs by avoiding oversubscription when there are more software threads than available hardware resources.

•Support for multiprocessing (DASK scheduler) which helps multiple processes talk to each other to coordinate total number of threads

Comparison of DASK+Intel TBB+ Intel MKL* accelerated code with non-optimised code

Recommendation and Future direction

  • Given that our proof of concept worked well, we wish to augment it further by adding new parameters like social media symptomatic data, life style, population dynamics, etc.
  • We wish to study and predict epidemic outbreak if any — like Nipah, leptospirosis in flood affected areas.
  • We can additionally request government and healthcare institutions for more data that’s not currently available in the public domain.
  • A crucial but often ignored factor in ML models is lack of sufficient structured data. Particularly, in healthcare sector, the plethora of data that is available is often ‘unstructured’ or plain text data and not amenable to ML
  • In near future, we’ll be using a bidirectional LSTM that we will implement in Intel TensorFlow* for Natural Language Processing (NLP). It would be used to build a classifier for different outbreak factors for disease which don’t have much structured data available.