Forecasting COVID-19 cases in Realtime for different countries

Rucha Sawarkar
Analytics Vidhya
Published in
6 min readNov 7, 2020

How to forecast the rise/decline or COVID-19 cases for a country or region?

On 11 February 2020, WHO announced a name for the new coronavirus disease: COVID-19.

This virus has taken the world by storm. After first detected in China, now coronavirus has spread in about 212 countries. There has been an exponential growth in the total number of cases for each country around the globe. As of 15th Dec 2020, the total reported cases reached 73M out of which 1.6M have died.

In the current pandemic Scenario of COVID-19, The number of infected patients and death toll are on the rise. Given the current rate of growth, where can the cases expect to reach in upcoming days, if no specific precaution is taken?

The below figure shows the total number of cases detected from Jan 2020 till Oct 2020 for some countries.

Total number of Cases for 4 countries from Jan to Oct 2020

The Idea is to understand the Pattern and predict the Spread of the Coronavirus infection in terms of

  1. Predicting the Total Number of cases in days to come
  2. Determine the daily increase in Cases and determining the hotspots
  3. How our Time Series Analysis can help the current situation?

We can do a time series analysis to create a model that helps in the forecast. We use Python Programming for our analysis. We load the necessary packages and attach the dataset.

Let’s explore the dataset and perform pre-processing steps. The dataset is available on this website. This dataset is updated daily. You can take the dataset directly from the URL or use the download version.

import itertoolsimport pandas as PDimport NumPy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import mean_squared_errorfrom math import sqrtfrom statsmodels.tsa.API import ExponentialSmoothing, SimpleExpSmoothing, Holtimport statsmodels.api as sm#From URLurl = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'full_data = pd.read_csv(url)#Import downloaded dataurl = r"…..\Covid19Data\owid-covid-data.csv"full_data = pd.read_csv(url)

If you take a look at data, it contains country wise data for date starting from 31st December 2019.

It contains total number of cases, new cases, total deaths and new deaths. We can ignore rest of the columns.

We will perform pre-processing steps to select particular country and respective columns from the dataset.

The below function also converts the data into time-series data. You can give any country name which you want to make predictions. Here I am considering “total_cases” column for analysis. A generalized code for selecting one column from four columns (‘total_cases’, ‘new_cases’, ‘total_deaths’, ‘new_deaths’) is uploaded in the github link.

country = “India”def data_for_country(country,data):    data = data[["location","date","total_cases"]] #select location,date and no of cases column    data = data[data["location"] == country] #select particular country    data = data[data["total_cases"] != 0] #remove null values    data.reset_index(inplace = True)    #covert to time series data    data.Timestamp = pd.to_datetime(data.date,format='%Y-%m-%d')    data.index = data.Timestamp    data=data.drop('index',1)    data=data.drop('location',1)    data=data.drop('date',1)    data = data.resample('D').mean()    data.total_cases=    data.total_cases.fillna(method='bfill').fillna(method='ffill')    return datadata = data_for_country(country, full_data)data.head(6)Out[16]: 
total_cases
date
2019-12-31 1.0
2020-01-01 1.0
2020-01-02 1.0
2020-01-03 1.0
2020-01-04 1.0
2020-01-05 1.0

Now we will plot this data for understanding the trend.

def plot_Data(df,country):
ts = df.iloc[:,0]
ts.plot(figsize=(15,8), title= 'Daily Cases '+country, fontsize=14,linestyle='dotted')
plt.xlabel("Date",fontsize = 10)
plt.ylabel('Total cases',fontsize = 10)
plt.legend(loc='best')
plt.show()

plot_Data(data,country)
Graph of total number of cases for India

For the evaluation of the model, we need to split the data into a training set and validation set. Random splitting is not possible in case of time series data. We have used the total number of days in last month as validation set.

today = date.today()
today = str(today)
today = today.replace(today[:8], '')
today = int(today)
split_index = len(data) - today
train=data[0:split_index]
test=data[split_index:]
train.head(6)Out[28]:
total_cases
date
2019-12-31 1.0
2020-01-01 1.0
2020-01-02 1.0
2020-01-03 1.0
2020-01-04 1.0
2020-01-05 1.0
test.head(6)Out[29]:
total_cases
date
2020-10-31 8137119.0
2020-11-01 8184082.0
2020-11-02 8229313.0
2020-11-03 8267623.0
2020-11-04 8313876.0
2020-11-05 8364086.0

I have used the ARIMA model for the prediction of this data. We need to find the trend and seasonality parameters for the ARIMA model. These parameters are p,d,q.

p- AR parameter

d- Order of differencing

q- MA parameter

We can infer from the COVID-19 data that we don’t have seasonality. So we can keep seasonality parameters as 0 for the model.

For prediction, we are considering different countries. As you can see in Figure 1, the graph of each country is having different trends. So making the series stationary and then selecting p,d,q values for each country would be a tedious task. To solve this issue, I have performed a grid search technique which automatically selects best p,d,q values giving the lowest RMSE score.

p = d = q = range(0, 4)pdq = list(itertools.product(p, d, q))seasonal_pdq = [(0,0,0,0)]params = []rms_arimas =[]for param in pdq:params.append(param)for param_seasonal in seasonal_pdq:try:y_hat_avg = test.copy()mod = sm.tsa.statespace.SARIMAX(train.iloc[:,0],order=param,seasonal_order=param_seasonal,enforce_stationarity=False,enforce_invertibility=False)results = mod.fit()y_hat_avg['SARIMA'] = results.predict(start=test.index[0],end=test.index[-1], dynamic=True)rms_arimas.append(sqrt(mean_squared_error(test.iloc[:,0], y_hat_avg.SARIMA)))except:continuedata_tuples = list(zip(params,rms_arimas))rms = pd.DataFrame(data_tuples, columns=['Parameters','RMS value'])minimum = int(rms[['RMS value']].idxmin())parameters = params[minimum]parametersOut[40]: (3, 3, 2)

Depending upon the country and the date, you might get different values for parameters. Now we will test our model.

y_hat_avg = test.copy()fit1 = sm.tsa.statespace.SARIMAX(train.total_cases, order=parameters,seasonal_order=(0,0,0,0),enforce_stationarity=False,enforce_invertibility=False).fit()y_hat_avg['SARIMA'] = fit1.predict(start="2020-10-31", end="2020-11-07", dynamic=True).astype(int)plt.figure(figsize=(16,8))
plt.plot( train['total_cases'], label='Train')
plt.plot(test['total_cases'], label='Test')
plt.plot(y_hat_avg['SARIMA'], label='SARIMA')
plt.title("ARIMA Forecast")
plt.legend(loc='best')
plt.show()
rms_arima = sqrt(mean_squared_error(test.total_cases, y_hat_avg.SARIMA))print(rms_arima)5312.48714956429
Graph of Train, Test and Predicted data

As you can see in the graph, test data and our predicted data is almost overlapping. I have compared the results with actual data using this model.

Comparison of Actual and Predicted Cases

The results are very similar. These model predictions were made on the 20th October 2020.

I have created the generalized model for the same. Below is the snippet of the generalized model. You can get the full code of the generalized model in this link and experiment giving different countries as Input and see the results.

Predictions for upcoming days with user selected country and number of days

Conclusion

In this tutorial, I have explained the time series forecasting of COVID-19 cases for different countries in the python programming language. I have tried to explain it with the codes.

Below are the Features of Implemented Model

  1. Data is real-time taken directly from the website (updated daily)
  2. Country-wise analysis
  3. Gives prediction for the upcoming number of days starting from today
  4. Any change in the trend of the data is captured during training on real-time data

Thank you for the read. I sincerely hope you found it helpful and as always I am open to constructive feedback.

Drop me a mail at rsawarkar80@gmail.com

You can find me on LinkedIn.

--

--

Rucha Sawarkar
Analytics Vidhya

Data Scientist at 3K Technologies. Gold Medalist from NIT Raipur. Passionate about learning new technologies. Dream of helping people using my knowledge.