Forecasting the Trend of COVID-19 using Machine Learning (Time Series Analysis)

Nikhil Kumar Parashar
ADGVIT
Published in
7 min readJul 9, 2020

Introduction:

The COVID-19 virus, AKA Coronavirus, first originated from Wuhan, China. Since then, it has spread to every corner of the world. And scientists are still making efforts to find a cure. But until then, it is on us to prevent it from spreading by practising social distancing and hygiene.

The World Health Organization (WHO) declared this outbreak a Public Health Emergency of International Concern (PHEIC).

Some Data Regarding COVID-19:

As of June 13, 2020, 19:34 hours IST, India stands at a total of 3,12,838 total confirmed cases, 1,46,912 total active cases, 1,56,972 total recovered, and 8,930 unfortunate deaths. On the positive out front, India has had a high rate of recovery, which is approximately 50%. But, we have a long way to go. We are 4th in terms of the highest number of confirmed cases after the USA, Brazil, and the Russian Federation, which is not very joyous.

Moving ahead, in this article, we will see how we can make predictions regarding COVID-19 situations using Machine Learning employing Time Series Analysis.

LET’S LEARN HOW THE MACHINE LEARNS!

Can the model be of any help to us?

How much would you give to know in advance of the dangers you are about to face? I personally would give a lot. And I know most of you would too. We got to agree on the fact that if we know what we are up against or what’s about to come, we can better prepare. So, why not here? This model can help us and a lot of people to predict the future pandemic state of COVID-19. Roughly accurate knowledge of the future pandemic situation can help the government to plan further lockdowns. They can make arrangements for more isolation wards if needed. Also, they can speed up the manufacturing of medicines, PPE — kits, etc. accordingly.

Algorithm Involved:

The process that we will be employing is Time Series Analysis. The algorithm used is called ARIMA, AKA Box-Jenkins. It is one of the most commonly used algorithms for Time Series Data. ARIMA stands for Auto-Regressive Integrated Moving Average. Any time-series data can be fitted to the above model to better understand or predict future points in the series.

The Process Involved:

We used sample data, which had a date wise distribution of the total number of confirmed, deceased, and recovered cases. This data started from January 30, 2020, and continued till June 10, 2020.

Before we start with the fitting of the data into the model, we import all the essential functions and modules, and the data.

(You can refer to the notebook through the following link: https://github.com/NikhilKP631197/ADG--ML-TASKS-2020/tree/master/Blog%201 )

Now, we move on to Data Processing.

To begin with, we first convert the date column to date-time format by using the datetime module provided by python. To do that, we first convert the string in the date column to a process-able format and then use the strptime() function to convert to the desired type. Then we set the index of the data frame to date columns using the set_index() method provided by the pandas library. Also, we drop the unnecessary columns.

At last, we use the shape attribute to realize the shape of our data frame, which in this case, is 133 rows and 3 columns.

Now, it is time to do some visualization.

We will be using matplotlib to plot some graphs between each of the columns with the dates.

In the above graph, we realize the past trend of the total number of active cases.

Next, we will visualize total deceased and recovered cases.

We find that the trends are clearly exponential.

Then, we divide the data frame into three data series as follows to work separately on each column.

Next, we check visually if the time series data is stationary or not. This step is essential because if the data is stationary, there is no need to use the time series analysis algorithm. A time-series data is said to be stationary if it has a constant mean and constant standard deviation.

This can be easily analyzed visually or by Dickey-Fuller Test. Here, we use the method of visualization.

We use the rolling() function and the functions mean() and std() to find the rolling mean and rolling standard deviation of the data.

Next, we visualize the rolling mean and standard deviation with the original data.

Now, from the graph obtained, we can conclude that the data is not stationary as neither of its mean nor standard deviation is constant with time. Hence, we can fit the data to a time series model and make our predictions. But, before we move to that, we have one more important thing to do. We need to make the data stationary and then visualize the trend, seasonality, and residual.

To make the data stationary, we convert the data to the logarithmic scale as follows.

Now, to realize THE THREE MUSKETEERS of any time series data, i.e., Trend, Seasonality, and Residual, we import one another very useful function seasonal_decompose() from the module statsmodels.tsa.seasonal.

And now, we visualize.

Once we have visualized all the musketeers, it is about time we fit the data to the ARIMA model.

EXCITED YET? ;p

To begin with our final step, we first plot a graph of AutoCorrelation and Partial AutoCorrelation Functions. From the graphs, we get q and p-value respectively at the x- values where the graph first approaches zero. As for the d-value, it is 1 as we have taken the shift difference only one time.

To plot the ACF and PACF graphs, we import acf() and pacf() functions from statsmodels.tsa.stattools module.

Now that we have the q, p, and d — values, now we can fit the data to the ARIMA model.

Now, we predict. ;)

Now, you can see in the graph that our model has taken the shape of the data, even though it might vary a little in magnitude.

It is time to forecast. :D

BOOM! We have predicted the upcoming pandemic situation with a 95% confidence level for the next 30 days. Don’t forget the predictions are on a logarithmic scale.

You can continue working with total deceased and recovered data in the same manner and predict their upcoming values.

.

The above graph is for the forecast of the Total Deceased Cases and Total Recovered Cases at a 95% confidence level, each with RSS scores of 0.7194 and 1.6279, respectively.

Conclusion:

The model created above helped us predict the upcoming pandemic state at a 95% confidence level with an RSS value of 2.9614, 0.7194, and 1.6279, respectively, which is quite good considering it is just a sample data and not an actual data.

This can always give us an idea of how the current national state will change to prepare us better. Even though this pandemic has bound us to our homes, it doesn’t mean we just have to sit and do nothing. We can keep learning and contribute to society in various ways.

STAY HOME, STAY SAFE!

Important Links:

Link to the data used — https://www.kaggle.com/imdevskp/covid19-corona-virus-india-dataset?select=nation_level_daily.csv

Link to the git hub repository for the python notebook — https://github.com/NikhilKP631197/ADG--ML-TASKS-2020/tree/master/Blog%201

THANK YOU!!!

--

--