Predicting the Outbreak of COVID-19 Pandemic using Machine Learning

Upasana Priyadarshiny
Edureka
Published in
14 min readMay 22, 2020
Predicting the Outbreak of COVID-19 using Machine Learning— Edureka

On day one, no one you know is sick. It feels like a normal day. But then one day, a few people you know are sick and suddenly, you see everyone is sick and it will feel like it happened so instantly. Everything looks fine until it isn’t. This is the paradox of pandemics. In this article, we shall analyze the outbreak of COVID-19 using Machine Learning.

Following is the outline of all you’re going to learn today:

1. What is COVID-19?

2. How does a Pandemic Work?

3. Case Study: Analysing the Outbreak of COVID-19 using Machine Learning

  • Problem Statement
  • Part 1: Analysing the present condition in India
  • Part 2: Is the trend Similar to Italy, Wuhan & South Korea?
  • Part 3: Exploring Worldwide Data?
  • Part 4: Forecasting Total Number of Cases Worldwide

4. Conclusion

What is COVID-19?

The Problem

Corona Virus disease (COVID-19) is an infectious disease caused by a newly discovered virus, which emerged in Wuhan, China in December of 2019.

Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.

The COVID-19 virus spreads primarily through droplets of saliva or discharge from the nose when an infected person coughs or sneezes, so you might have heard caution to practice respiratory etiquette (for example, by coughing into a flexed elbow).

How does a Pandemic Work?

To understand this better let’s look at a small riddle.

There’s a glass slide held under a microscope that consists of a specific germ. This germ has a property to double every day. So on the first day, there’s one, on the second day there are two, on the third day there four and the fourth day eight, and so on.

On the 60th day, the slide is full. So on which day is the slide half full?

Day 59. But of course, you knew that.

But on which day, is the slide 1% full?

Surprisingly, not until the 54th day!

What it means that the slide goes from being 1% full to 100% in less than a week and hence, displays a property called exponential growth. And this is also how a Pandemic works. The outbreak is fairly unnoticeable in the beginning, then, once it reaches a significant value, the growth to maxima is extremely quick.

But it cannot go on forever. The virus will eventually stop finding people to infect and ultimately will slow down the count. This is called logistic growth and the curve is known as a sigmoid.

Now every point in the curve will give you to running total of cases of the current day. But if you delve a little into statistics, you’ll discover that by plotting the slope of each day, you shall get the new cases per day. There are fewer new cases right at the beginning and at the end, with a sharp rise in the stages in between.

As you can see, the peak of the curve may greatly overwhelm our healthcare systems, which is the number of resources available to us for the care of affected individuals at any given point in time.

Since we can’t really help the total number of individuals affected by the pandemic, the best solution is to flatten the curve so as to bring down the total number of cases, at any given point in time, as close to the healthcare line as possible.

This spreads the duration of this whole process a little longer, but since the healthcare system can tend to the number of cases at any given point in time, the casualties are way lower.

The Solution

Social Distancing. The logic here is, the virus can’t infect bodies if it cannot find bodies to infect!

World Leaders in all affected countries announced quarantines and lock-downs to keep their folks safe and away from anything or anyone that could infect them, all big social events were postponed and all major sports leagues canceled as well.

On March 24, the Indian Prime Minister announced that the country would go under a lock-down to combat the spread of the virus, until further notice. Infections are rapidly rising in Italy, France, Germany, Spain, the United Kingdom, and the United States. It has has a massive impact on the global economy and stock markets

The outbreak of COVID-19 is developing into a major international crisis, and it’s starting to influence important aspects of daily life.

For example:

  • Travel: Complete lock-down no domestic or international flights are allowed in India until decided by the Ministry of Civil Aviation.
  • Grocery stores: In highly affected areas, people are starting to stock up on essential goods leading to a shortage of essentials.

Case Study: Analysing the Outbreak of COVID 19 using Machine Learning

Problem Statement

We need a strong model that predicts how the virus could spread across different countries and regions. The goal of this task is to build a model that predicts the spread of the virus in the next 7 days.

🔴NOTE: The model was built on a test dataset updated till April,’20. But you can access the source to these datasets at the ‘John Hopkins University Coronavirus Resource Centre’ which gets updated on a daily basis, so you can run this model for the date you prefer.

Tasks to be performed:

  1. Analyzing the present condition in India
  2. Is this trend similar to Italy/South Korea/ Wuhan
  3. Exploring the world wide data
  4. Forecasting the worldwide COVID-19 cases using Prophet

Before we begin with the model, let’s first import the libraries that we need. Consider this a Step 0 if you may.

# importing the required libraries
import pandas as pd
# Visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium import plugins
# Manipulating the default plot size
plt.rcParams['figure.figsize'] = 10, 12
# Disable warnings
import warnings
warnings.filterwarnings('ignore')

Here we import a few important libraries that we shall use throughout the model. is an extremely fast and flexible data analysis and manipulation tool and allows you to allow you to store and manipulate tabular data. We also import visualization libraries such as matplotlib, seaborn, and plotly.

And finally, we determine the default plot size and disable warnings in our module.

Part 1: Analysing the present condition in India

So, how did it actually start in India?

The first COVID-19 case was reported on 30th January 2020 when a student arrived in Kerala, India from Wuhan, China. Just in the next 2 days, Kerela reported 2 more cases. For almost a month, no new cases were reported in India, however, on 2nd March 2020, five new cases of coronavirus were reported in Kerala again and since then the cases have only been rising.

1.1 Reading the Datasets

First, we’re going to start out by reading our datasets by creating a data frame using Pandas.

# Reading the datasets
df= pd.read_excel('/content/Covid cases in India.xlsx')
df_india = df.copy()
df
# Coordinates of India States and Union Territories
India_coord = pd.read_excel('/content/Indian Coordinates.xlsx')
#Day by day data of India, Korea, Italy and Wuhan
dbd_India = pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name='India')
dbd_Italy=pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name="Italy")
dbd_Korea=pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name="Korea")
dbd_Wuhan=pd.read_excel('/content/per_day_cases.xlsx',parse_dates=True, sheet_name="Wuhan")df.drop(['S. No.'],axis=1,inplace=True) df['Total cases'] = df['Total Confirmed cases (Indian National)'] + df['Total Confirmed cases ( Foreign National )'] total_cases = df['Total cases'].sum() print('Total number of confirmed COVID 2019 cases across India till date (22nd March, 2020):', total_cases)

1.2 Analysing COVID19 Cases in India

So, here we’re going to play around with the data frame and create a new attribute called ‘Total Case’.

This attribute is the total number of confirmed cases (Indian National + Foreign National)

df.drop(['S. No.'],axis=1,inplace=True)df['Total cases'] = df['Total Confirmed cases (Indian National)'] + df['Total Confirmed cases ( Foreign National )']total_cases = df['Total cases'].sum()print('Total number of confirmed COVID 2019 cases across India till date (22nd March, 2020):', total_cases)

We are also going to highlight our data according to its geographical location in India.

df.style.background_gradient(cmap='Reds')

As you might have guessed, the redder the cell the bigger the value. So, the darker cells represent a higher number of affected cases and the lighter ones show otherwise.

1.3 Number of Active COVID-19 cases in affected State/Union Territories

#Total Active  is the Total cases - (Number of death + Cured)df['Total Active'] = df['Total cases'] - (df['Death'] + df['Cured'])total_active = df['Total Active'].sum()print('Total number of active COVID 2019 cases across India:', total_active)Tot_Cases = df.groupby('Name of State / UT')['Total Active'].sum().sort_values(ascending=False).to_frame()Tot_Cases.style.background_gradient(cmap='Reds')

1.4 Visualising the spread geographically

Next, we shall use Folium to create a zoomable map corresponding to the number of cases in different geographies.

df_full = pd.merge(India_coord,df,on='Name of State / UT')map = folium.Map(location=[20, 70], zoom_start=4,tiles='Stamenterrain')for lat, lon, value, name inzip(df_full['Latitude'], df_full['Longitude'], df_full['Total cases'], df_full['Name of State / UT']):folium.CircleMarker([lat, lon], radius=value*0.8, popup = ('<strong>State</strong>: ' + str(name).capitalize() + '''<strong>Total Cases</strong>: ' + str(value) + ''),color='red',fill_color='red',fill_opacity=0.3 ).add_to(map)map

1.5 Confirmed vs Recovered figures

Next, we are going to use Seaborn for visualization.

f, ax = plt.subplots(figsize=(12, 8))data = df_full[['Name of State / UT','Total cases','Cured','Death']]data.sort_values('Total cases',ascending=False,inplace=True)sns.set_color_codes("pastel")sns.barplot(x="Total cases", y="Name of State / UT", data=data,label="Total", color="r")sns.set_color_codes("muted")sns.barplot(x="Cured", y="Name of State / UT", data=data, label="Cured", color="g")# Add a legend and informative axis labelax.legend(ncol=2, loc="lower right", frameon=True)ax.set(xlim=(0, 35), ylabel="",xlabel="Cases")sns.despine(left=True, bottom=True)

1.6 The Rise of the Coronavirus cases

Next, you’re going to use Plotly to obtain graphs depicting the trends of the rise of coronavirus cases across India.

#This cell's code is required when you are working with plotly on colabimport plotlyplotly.io.renderers.default = 'colab'# Rise of COVID-19 cases in Indiafig = go.Figure()fig.add_trace(go.Scatter(x=dbd_India['Date'], y = dbd_India['Total Cases'], mode='lines+markers',name='Total Cases'))fig.update_layout(title_text='Trend of Coronavirus Cases in India (Cumulative cases)',plot_bgcolor='rgb(230, 230, 230)')fig.show()import plotly.express as pxfig = px.bar(dbd_India, x="Date", y="New Cases", barmode='group', height=400)fig.update_layout(title_text='Coronavirus Cases in India on daily basis',plot_bgcolor='rgb(230, 230, 230)')fig.show()

Part 2: Is the trend Similar to Italy, Wuhan & South Korea?

At this point, India had already crossed 500 cases. It still is very important to contain the situation in the coming days. The numbers of coronavirus patients had started doubling after many countries hit the 100 marks, and almost starting increasing exponentially.

2.1 Cumulative cases in India, Italy, S.Korea, and Wuhan

# import plotly.express as pxfig = px.bar(dbd_India, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,title='Confirmed Cases in India', color_discrete_sequence = px.colors.cyclical.IceFire)'''Colour Scale for plotlyhttps://plot.ly/python/builtin-colorscales/'''fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')fig.show()fig = px.bar(dbd_Italy, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,title='Confirmed Cases in Italy', color_discrete_sequence = px.colors.cyclical.IceFire)fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')fig.show()fig = px.bar(dbd_Korea, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,title='Confirmed Cases in South Korea', color_discrete_sequence = px.colors.cyclical.IceFire)fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')fig.show()fig = px.bar(dbd_Wuhan, x="Date", y="Total Cases", color='Total Cases', orientation='v', height=600,title='Confirmed Cases in Wuhan', color_discrete_sequence = px.colors.cyclical.IceFire)fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')fig.show()

From the visualization above, one can infer the following:

  • Confirmed cases in India is rising exponentially with no fixed pattern (Very less test in India)
  • Confirmed cases in Italy is rising exponentially with a certain fixed pattern
  • Confirmed cases in S.Korea is rising gradually
  • There has been almost a negligible number of confirmed cases in Wuhan a week.

2.2 Comparison between the rise of cases in Wuhan, S.Korea, Italy, and India

# import plotly.graph_objects as gofrom plotly.subplots import make_subplotsfig = make_subplots(rows=2, cols=2,specs=[[{}, {}],[{"colspan": 2}, None]],subplot_titles=("S.Korea","Italy", "India","Wuhan"))fig.add_trace(go.Bar(x=dbd_Korea['Date'], y=dbd_Korea['Total Cases'],marker=dict(color=dbd_Korea['Total Cases'], coloraxis="coloraxis")),1, 1)fig.add_trace(go.Bar(x=dbd_Italy['Date'], y=dbd_Italy['Total Cases'],marker=dict(color=dbd_Italy['Total Cases'], coloraxis="coloraxis")),1, 2)fig.add_trace(go.Bar(x=dbd_India['Date'], y=dbd_India['Total Cases'],marker=dict(color=dbd_India['Total Cases'], coloraxis="coloraxis")),2, 1)# fig.add_trace(go.Bar(x=dbd_Wuhan['Date'], y=dbd_Wuhan['Total Cases'],#                     marker=dict(color=dbd_Wuhan['Total Cases'], coloraxis="coloraxis")),2, 2)fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False,title_text="Total Confirmed cases(Cumulative)")fig.update_layout(plot_bgcolor='rgb(230, 230, 230)')fig.show()

2.3 Trend after crossing 100 cases

# import plotly.graph_objects as gotitle = 'Main Source for News'labels = ['S.Korea', 'Italy', 'India']colors = ['rgb(122,128,0)', 'rgb(255,0,0)', 'rgb(49,130,189)']mode_size = [10, 10, 12]line_size = [1, 1, 8]fig = go.Figure()fig.add_trace(go.Scatter(x=dbd_Korea['Days after surpassing 100 cases'],y=dbd_Korea['Total Cases'],mode='lines',name=labels[0],line=dict(color=colors[0], width=line_size[0]),connectgaps=True))fig.add_trace(go.Scatter(x=dbd_Italy['Days after surpassing 100 cases'],y=dbd_Italy['Total Cases'],mode='lines',name=labels[1],line=dict(color=colors[1], width=line_size[1]),connectgaps=True))fig.add_trace(go.Scatter(x=dbd_India['Days after surpassing 100 cases'],y=dbd_India['Total Cases'],mode='lines',name=labels[2],line=dict(color=colors[2], width=line_size[2]),connectgaps=True))annotations = []annotations.append(dict(xref='paper', yref='paper', x=0.5, y=-0.1,xanchor='center', yanchor='top',text='Days after crossing 100 cases ',font=dict(family='Arial',size=12,color='rgb(150,150,150)'),showarrow=False))fig.update_layout(annotations=annotations,plot_bgcolor='white',yaxis_title='Cumulative cases')fig.show()

Part 3: Exploring Worldwide Data

The following code will give you tabular data about the location and status of confirmed cases by date.

df = pd.read_csv('/content/covid_19_clean_complete.csv',parse_dates=['Date'])df.rename(columns={'ObservationDate':'Date', 'Country/Region':'Country'}, inplace=True)df_confirmed = pd.read_csv("/content/time_series_covid19_confirmed_global.csv")df_recovered = pd.read_csv("/content/time_series_covid19_recovered_global.csv")df_deaths = pd.read_csv("/content/time_series_covid19_deaths_global.csv")df_confirmed.rename(columns={'Country/Region':'Country'}, inplace=True)df_recovered.rename(columns={'Country/Region':'Country'}, inplace=True)df_deaths.rename(columns={'Country/Region':'Country'}, inplace=True)df_deaths.head()df2 = df.groupby(["Date", "Country", "Province/State"])[['Date', 'Province/State', 'Country', 'Confirmed', 'Deaths', 'Recovered']].sum().reset_index()df2.head()
#Overall worldwide Confirmed/ Deaths/ Recovered cases
df.groupby('Date').sum().head()

Visualizing: Worldwide COVID-19 cases

confirmed = df.groupby('Date').sum()['Confirmed'].reset_index()deaths = df.groupby('Date').sum()['Deaths'].reset_index()recovered = df.groupby('Date').sum()['Recovered'].reset_index()fig = go.Figure()#Plotting datewise confirmed casesfig.add_trace(go.Scatter(x=confirmed['Date'], y=confirmed['Confirmed'], mode='lines+markers', name='Confirmed',line=dict(color='blue', width=2)))fig.add_trace(go.Scatter(x=deaths['Date'], y=deaths['Deaths'], mode='lines+markers', name='Deaths', line=dict(color='Red', width=2)))fig.add_trace(go.Scatter(x=recovered['Date'], y=recovered['Recovered'], mode='lines+markers', name='Recovered', line=dict(color='Green', width=2)))fig.update_layout(title='Worldwide NCOVID-19 Cases', xaxis_tickfont_size=14,yaxis=dict(title='Number of Cases'))fig.show()

Part 4: Forecasting Total Number of Cases Worldwide

In this segment, we’re going to generate a week ahead forecast of confirmed cases of COVID-19 using Prophet, with specific prediction intervals by creating a base model both with and without tweaking of seasonality-related parameters and additional regressors.

What is the Prophet?

The prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.

We use Prophet, a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Why Prophet?

  • Accurate and fast: Prophet is used in many applications across Facebook for producing reliable forecasts for planning and goal setting. Facebook finds it to perform better than any other approach in the majority of cases. It fit models in Stan so that you get forecasts in just a few seconds.
  • Fully automatic: Get a reasonable forecast on messy data with no manual effort. The prophet is robust to outliers, missing data, and dramatic changes in your time series.
  • Tunable forecasts: The Prophet procedure includes many possibilities for users to tweak and adjust forecasts. You can use human-interpretable parameters to improve your forecast by adding your domain knowledge
  • Available in R or Python: Facebook has implemented the Prophet procedure in R and Python. Both of them share the same underlying Stan code for fitting. You can use whatever language you’re comfortable with to get forecasts.
from fbprophet import Prophet confirmed = df.groupby('Date').sum()['Confirmed'].reset_index() deaths = df.groupby('Date').sum()['Deaths'].reset_index() recovered = df.groupby('Date').sum()['Recovered'].reset_index()

The input to Prophet is always a data frame with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM: SS for a timestamp. The y column must be numeric and represents the measurement we wish to forecast.

confirmed.columns = ['ds','y']#confirmed['ds'] = confirmed['ds'].dt.dateconfirmed['ds'] = pd.to_datetime(confirmed['ds'])confirmed.tail()

4.1 Forecasting Confirmed COVID-19 Cases Worldwide with Prophet (Base model)

Generating a week ahead forecast of confirmed cases of COVID-19 using Prophet, with a 95% prediction interval by creating a base model with no tweaking of seasonality-related parameters and additional regressors.

m = Prophet(interval_width=0.95)m.fit(confirmed)future = m.make_future_dataframe(periods=7)future.tail()

The predicted method will assign each row in future a predicted value which it names that. If you pass on historical dates, it will provide an in-sample fit. The forecast object here is a new data-frame that includes a column with the forecast, as well as columns for components and uncertainty intervals.

#predicting the future with date, and upper and lower limit of y valueforecast = m.predict(future)forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

You can plot the forecast by calling the Prophet. plot method and passing in your forecast data frame.

confirmed_forecast_plot = m.plot(forecast)
confirmed_forecast_plot = m.plot(forecast)

4.2 Forecasting Worldwide Deaths using Prophet (Base model)

Generating a week ahead forecast of confirmed cases of COVID-19 using the Machine Learning library — Prophet, with a 95% prediction interval by creating a base model with no tweaking of seasonality-related parameters and additional regressors.

deaths.columns = ['ds','y']deaths['ds'] = pd.to_datetime(deaths['ds'])m = Prophet(interval_width=0.95)m.fit(deaths)future = m.make_future_dataframe(periods=7)future.tail()forecast = m.predict(future)forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()deaths_forecast_plot = m.plot(forecast)
deaths_forecast_plot = m.plot_components(forecast)

4.3 Forecasting Worldwide Recovered Cases with Prophet (Base model)

Generating a week ahead forecast of confirmed cases of COVID-19 using Prophet, with a 95% prediction interval by creating a base model with no tweaking of seasonality-related parameters and additional regressors.

recovered.columns = ['ds','y']recovered['ds'] = pd.to_datetime(recovered['ds'])m = Prophet(interval_width=0.95)m.fit(recovered)future = m.make_future_dataframe(periods=7)future.tail()forecast = m.predict(future)forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()recovered_forecast_plot = m.plot(forecast)
recovered_forecast_plot = m.plot_components(forecast)

Conclusion

This is a humble request to all our learners.

Don’t take your cough and cold lightly as you would. If you look at the data, the number of cases in India is rising just like in Italy, Wuhan, S.Korea, Spain, or the USA. We have crossed 100,000 cases already. Don’t let lower awareness and fewer test numbers ruin the health of our world.

Currently, India is a deadly and risky zone as there are very few COVID-19 test centers available. Imagine how many infected people are still around you and are infecting others unknowingly.

Let’s give a hand in fighting this pandemic at least by quarantining ourselves by staying indoors and protecting ourselves and others around us.

Originally published at https://www.edureka.co on May 22, 2020.

--

--