Disease Outbreaks

NiroshaBugatha
SFU Professional Computer Science
14 min readApr 20, 2020

Navaneeth M, Nirosha Bugatha, Kunal Niranjan Desai, Arjun Mahadevan

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

Motivation and Background :

When reports of the Corona Virus outbreak were coming in during January 2020 from Wuhan, China, we were interested in learning about outbreaks. We did discuss the possibility of a disease outbreak causing something like a zombie apocalypse and the end of the world. As an initial simulation, we downloaded ‘Plague Inc’ (a simulation/strategy game that is based on infecting the world with a pathogen), and ran trials of different combinations of bacteria, virus profiles to compare the rate of spread. There were more than 65 outbreaks from the past, We wanted to explore all the outbreaks and make a publicly informative data product. The Dashboard briefs about epidemiological profile of each disease including its pathogen, source, symptom, pathogen host etc. In addition, we wanted to get insights about the fatalities and spread of the current COVID-19 crisis, compared to the past outbreaks. Looking back from our current crisis, we could never have predicted that COVID-19 would actually shut down the world.

Problem Statement :

Our first objective was to understand the Case Fatality Ratio(CFR) for different diseases grouped by country and dates. The second objective was to understand the factors that affect the spread of the disease(Transmission Rate, Recovery Rate). Through our project, we answer the following questions:

  1. What were the most fatal outbreaks experienced around the world in each country?
  2. How does COVID-19 compare to past outbreaks?
  3. Do some countries have more experience handling outbreaks?
  4. Can we model the spread of any disease?

Answering these questions are challenging, though currently we have a flood of data updated from several sources on the COVID-19 outbreak, we did not have organized data on other outbreaks from the past. The second challenge, we were exposed to a whole new domain of epidemiology, we had to identify features, a disease modelling approach and evaluation metrics that help model the spread of a disease during the pandemic. Moreover, consolidated data from past outbreaks would be useful information for the general public in a dashboard for easy viewing and reference. We followed an organized approach to solve these challenges, our data science pipeline is explained in the following section.

Data Science Pipeline :

Data Collection : While we had limited organized data on past outbreaks, we had a vast collection of online news articles and pandemic reports cataloged in the WHO and CDC websites. These articles also included a detailed profile for each disease to include pathogen, host, sources, medium of spread, symptoms, and incubation periods. Our data collection task had the following three sub-tasks.

  1. Web scraping using python library ‘Beautiful Soup’ to extract disease outbreak incidents based on date, country, from the year 1996 to 2020. From each article we extracted: date, country, disease name, and description that was populated to a .CSV file.
  2. Extensively used SpaCy for the NLP task to extract entities from the news articles and text descriptions. We automated the extraction of cumulative deaths and reported cases for each outbreak from the web articles. In addition to this, we populated an independent spreadsheet, with detailed information about 67 diseases from the past that have resulted in an outbreak. This included host, pathogen, source, transmission mode, pathogen host, symptoms, vaccination and incubation period.
SpaCy Extraction
Disease Insights
Disease Details

3. Used an API to daily update COVID-19 information from Kaggle (& Johns Hopkins Website) for detailed record of reported cases, deaths and recovered cases.

Data cleaning : Once the data was consolidated into csv files to include date, disease, country, cumulative reported cases, and cumulative deaths, the data was grouped and formatted using Pandas and Numpy libraries for model development and dashboard visualization.

Visualization : Power BI is a powerful dash-boarding tool that offers a range of features from loading data, interfacing with cloud platforms, transporting data for easy visualization and most importantly publishing the dashboard for online viewing. Once we had all the data for each country, and a profile for each disease, we used Power BI to create a visually appealing dashboard combining the two different information to study each disease. Our dashboard has a tab of ‘All diseases’ grouped by disease that displays all countries that had an outbreak of that disease. It also displays a profile for each disease, to include transmission medium, pathogen name, pathogen host, pathogen source, incubation period, symptoms, etc. A table shows the breakdown by country of the total deaths and reported cases, a line plot shows the year of occurrence. In the Dashboard, when we select a particular disease all the information about the disease can be viewed in a minimal verbose and understandable way. ‘COVID-19’ of the Dashboard is dedicated to the COVID-19 data to display infected cases, deaths and recovered cases over the past three months of the outbreak.

All Diseases Dashboard
COVID-19 Dashboard

Detecting Anomalies: Since most of our data for other outbreaks was compiled using an entity extraction NLP model, we used the Case Fatality Ratio(CFR) metric to detect anomalies. About 2 percent of the data had errors due to various factors including different authors for the news articles, different reporting time periods, and varied formatting of the news articles over the period of years from 1996 to 2020. These errors were fixed manually in the spreadsheet, and published literature was used to check our CFR calculation with approximate values for the CFR for each disease published in WHO, CDC websites.

Modelling : To understand the severity of an outbreak we used three metrics: 1. Case Fatality Ratio (CFR) and 2. Transmission Rate (Beta), Recovery Rate (Gamma). We had two independent models for each of these metrics.

To model the case fatality ratio, each disease occurrence was grouped based on country, and for each outbreak incident, a cluster was developed based on 3 times the incubation period. In calculating CFR, we used the fact that: If no new reported cases were detected for a period of 3 times the incubation period in that country, then the outbreak of that disease has ended for that country. If new cases came back after this period, then it would be considered a second wave of the same outbreak.

Equation: Case Fatality Ratio

To model the spread of COVID-19, we used a time dependent SIR model. The parameters for Beta and Gamma were learned from the COVID-19 data using the first 45 days of the outbreak for each country. The estimated transmission rate and recovery rate were used to predict the Susceptible (S), Infected (I), and Recovered (R) cases solving for the S, I and R differential equations. The details of the model are provided in the methodology section below.

Model Evaluation & Inference : The model was evaluated using true data published in literature, WHO sources and other data sources for the COVID -19. Evaluation of the SIR model was compared to the actual reported cases, using a hold out and test data-set.

Methodology :

Fatal Outbreaks between 1996 to 2020 : Each disease incident in a country was clustered into an outbreak based on the first reported case as the start of an outbreak in that country, to when no new cases are reported for a period of 3 times the incubation period. Once each disease and country were clustered, we calculated the ratio of deaths to the total reported cases. In total, we had over 475 clusters, and outbreaks for 67 diseases and 220 countries. The video below shows the most fatal outbreaks between 1996 to 2020 for each year including: Country, Disease, CFR, and deaths across the world.

Fatal Outbreaks between 1996–2020

Most Fatal Outbreaks since 1996 : The table below lists some of the most fatal outbreaks we have seen. Diseases like Ebola do not spread to a vast population, however it results in fatal consequences, it has a high CFR and death toll. About 70% to 90% of the infected cases die of the disease. Diseases that do not have a vaccine also result in taking a prime position until a vaccine is developed, an example is SARS in 2003.

Outbreaks like Dengue and Cholera do not have a large number of fatal deaths, their CFR is low in the order of 3% to 7% similar to COVID-19, however it infects a large population.

Example: A Dengue outbreak in Pakistan in 2019 infected more than 47,000 people and a Cholera outbreak in Zimbabwe in 2009 infected over 79,000 people.

It should be noted that the same disease can have different CFR’s for different countries depending on the country’s infrastructure and ability to manage the outbreak.

Example: The Influenza outbreak in Madagascar had a death toll of 357 people in 2002 with a CFR of 84.4%, and the avian flu had a CFR of 81% in Indonesia in 2008. The table shows an example for the current COVID-19 fatality ratio for different countries.

Countries that experienced most outbreaks : While countries like Congo and Gabon have had an outbreak every year, for some countries the COVID-19 is the first outbreak. The histogram bar plot, shows the countries that have seen the most outbreaks since 1996. Congo has had the most outbreaks since 1996, they have had a total of more than 37 outbreaks, followed by China which has had close to 24 outbreaks including the COVID-19. While all the outbreaks in Congo have been fatal, China has had a large number of low fatality Influenza and Avian Flu type outbreaks. A number of Corona-virus outbreaks have been witnessed in China over the past including the Avian Flu, and SARS.

SIR model : SIR model is a compartmental epidemic model that describes the dynamics of an infectious disease. The model divides the population into compartments — S for Susceptible, I for Infected and R for Recovered.

Susceptible is the group of people who are vulnerable to the infection. The group of infected represents the infected people. They can pass the disease to susceptible people and can recover in a specific period. Recovered people get immunity, so that they are not susceptible to the same illness anymore. SIR model mathematically formulates the transition of individuals in a population between ‘compartments’ that capture the infection status of individuals, leading to significant insights.

The model describes the number of people in each compartment using ordinary differential equations shown below. β is a parameter controlling the rate of disease transmission through exposure. It is determined by the chance of contact and the probability of disease transmission. γ is a parameter expressing the rate an infected recovers and moves into the resistant phase.

Equation: Relation of SIR with the population parameter n.
Equation: Differential Equation for Susceptibility(S)
Equation: Differential Equation for Infectivity(X)
Equation: Differential Equation for Recoverability(R)

A traditional SIR Model neglects the time varying property of the transmission rate and recovery rate, thus we are employing the use of a time-dependent SIR model, where both the transmission rate and recovery rate are functions of time t. We have a rolling FIR (Finite Impulse Response) window of 3 days to capture the time varying dynamics of the infectious disease over a certain period. We then used machine learning methods like Ridge Regression and Gradient Boosting Regressor to learn the parameters of transmission rate and recovery rate. The data was split into 45 days and 10 days as training and test splits. For other outbreaks, the train-test split is decided based on the data points available for the particular outbreak. We always had a hold-out set to test our model predictions.

Equation: Time Dependent Formulation of Beta(Transmission Rate)
Equation: Time Dependent Formulation of Gamma(Recovery Rate)

A time-dependent SIR model is more powerful in tracking disease spread, control, and predicting the future trend of an infectious disease. An example is that a constant β does not take into account the measures taken by government or the people to reduce the transmission rate. Moreover each country will have an independent measure of β, depending on how they handle the crisis.

Regression Equation for Predicting Transmission Rate
Regression Equation for Predicting Recovery Rate

Our model for predicting infected cases and recovered cases depends on the two parameters: 1) Transmission Rate and 2) Recovery rate. These two parameters are complex to model, it depends on several factors, from the disease epidemiological profile, disease management, government action, countries infrastructure, etc.

Using our model to train on the first 45 days of the outbreak, we learn these two parameters for each country. Two example plots for the model prediction on transmission rate and recovery rate are shown below for two countries: The USA and Canada. Our observation is that though the model is a fair prediction against data for these two parameters, the actual prediction of spread is significantly different from true number of infections.

USA insights :

Transmission Rate :

The plot below shows the actual data, model predicted data (using learned transmission rate) and a transmission rate that fits the data better. We notice a wide gap for certain countries between the actual transmission rate and the predicted transmission rate. This is due to the fact that the reported data does not capture the true infections that are spreading. For example in the United States on March 1, there were 70 reported cases of COVID-19, however the true figure may have already been in the 1500’s, which is not captured in the data.

Predicted Transmission Rate
Predicted Infections

Recovery Rate :

Similar trends can be observed for the recovery rate and prediction of the recovery rate. There is a dependency of recovered cases on the infected cases. Recovered cases are a lagging number from the infected cases, when there is a gap in the infection rate this will reflect itself in the recovery rate as well. Hence, we notice a difference from actual data to the model prediction for a future time period.

Predicted Recovery Rate
Predicted Recovery Numbers

Canada Insights :

Another important aspect of the transmission rate is each country will have a different behavior due the complex nature of transmission rate, and how effectively a country manages an outbreak. The plot below shows the model prediction compared to true data. We notice that the prediction model for Canada is much closer to actual and not as significantly different as compared to the US. This is due to the difference in the spread of the virus in Canada compared to the US.

Predicted Transmission Rate
Predicted Infected Numbers
Predicted Recovery Rate
Predicted Recovered Numbers

Evaluation : Our evaluation consisted of two parts, for CFR we used WHO published fatalities and pandemic reports to verify if our reported numbers match published literature. For transmission rate and recovery rate, the comparison is more complex, as explained the data generation process during an outbreak has certain limitations. Our model results also show this limitation of data reporting and the challenges of accurately modelling the transmission during an outbreak. Government officials, law makers and decision makers rely on the reported data to make decisions, but during outbreaks it is indeed prudent to not entirely rely on reported cases and deaths, but take action sooner. The spread and transmission is different for each disease and for each country, it is better to take action sooner, than later and safer to be more cautious in decision making as we are witnessing in the COVID-19 crisis.

Data Product :

Dashboard

Lessons Learnt:

We were able to ask a question, search for the source of the data, identify an efficient way to collect the data, detect anomalies, make corrections manually when needed, and use the data to model, compare results, and present complex information through a simple dashboard. The topic of epidemiology was new to us, we had limited domain knowledge on the subject, however we were able to find resources to learn and apply tools to create a CFR and SIR model for the COVID-19. We were happy to work on a current topic that has a significant impact on the society. Moreover, it was an interesting learning experience that sometimes traditional machine learning tools and techniques may not be very helpful when data is limited. However, the fundamental statistics and technical principle will still be useful in developing models and solutions to a wide variety of problems. Our project followed a complete data science pipeline using all the aspects of what we learned from our course ranging from data generation, model development and presentation using a dashboard.

Summary :

Our goal is to provide a consolidated view of data on all outbreaks between 1996 to 2020. Though there are several news articles published in the WHO, CDC news section, there was limited resources for a consolidated view of all outbreaks around the world. Our dashboard summarizes all past outbreaks to include disease information, occurrences, deaths and reported cases. We integrated the COVID-19 data into an independent tab to track infected cases, recovered cases and fatalities for each country. While the source for the data for other outbreaks was from WHO/CDC, the source for the COVID-19 data was from Kaggle/John Hopkins Corona Virus Center through an API.

To consolidate data from other outbreaks, we extensively used web scraping and SpaCy NLP library in python to extract entities namely reported cases and deaths for each country and diseases that have resulted in an outbreak. For 67 diseases that have resulted in an outbreak in the past, we have compiled a database of pathogen name, pathogen host, pathogen source, mode of transmission, common symptoms, vaccination(yes/no) and incubation period. We used the combined information to cluster the outbreaks over the time period and compute the case fatality ratios for each disease and country.

To model the spread of a disease, we used the COVID-19 data, and set up a time dependent SIR model to track the variation in the transmission rate and recovery rate, which are complex parameters determined by several factors including government action to contain the disease spread. We used the learned transmission rate and recovery rate to predict the growth of infected cases by solving the ordinary differential equations for a SIR model. The model results show a gap in actual spread against the reported cases during the initial phase of the disease. The gap in reported cases could be due to various reasons such as insufficient testing, under reporting, longer incubation periods and low fatality ratios for a particular virus. Our model captures the fact that during an outbreak, it is always prudent to take action sooner, as the actual spread is quite different from the reported cases.

Our data is compiled and published in CS Gitlab repository and Kaggle. We are open to feedback, questions and suggestions.

--

--