Forecasting Daily New Confirmed COVID-19 Cases in Maldives — Part 2

Predicting Daily New Cases using ARIMA Models.

This second article will explain how to forecast daily new COVID-19 confirmed cases in Maldives. Check out my first article about forecasting using Simple Exponential Smoothing, Linear Trend Model, and Holt-Winters Smoothing here.

Introduction

The Box-Jenkins technique is a collection of processes for finding and estimating time series models related to the AutoRegressive Integrated Moving Average (ARIMA). ARIMA models strongly rely on the data’s autocorrelation pattern. According to Jie (2021), three items are required to determine the appropriate ARIMA models: the time series, ACF plot, and PACF plot. After analyzing the ACF and PACF plots and tested using the unit root test, the next step is to determine whether the time series needs to be differencing or not. Then, diagnostic checking is carried out to determine whether the model that has been made is adequate or not. The criteria for diagnostic checking are the z-test for coefficient significance, residual analysis, and model selection criteria based on forecast error. Diagnostic checking will be carried out using Ljung-Box tests, and if the p-value tests > 0.05, it can be concluded that the model is adequate and can be used for forecasting.

Autocorrelation Analysis

Before Differencing

As mentioned in the previous article , the ACF plot is dying down exceptionally slowly, and autocorrelation remains significant for several lags, which indicates that the series is not stationary. In addition, the PACF plot dies down exponentially with oscillation.

In addition, KPSS tests have been carried out for time series that have not been differencing. From the results of the KPSS test, it can be concluded that differencing is needed to change the series from non-stationary to stationary.

First Differencing

In the ACF plot, it can be seen that ACF experienced sine waves dying down and being cut-off at lag one. At the same time, the PACF plot experienced a cut-off at lag two (dies down with oscillation).

> Kwiatkowski–Phillips–Schmidt–Shin test on First Differencing

* Hypothesis:

• H0: The series is stationary.

• H1: The series is not stationary.

* Criteria:

• If the p-value is < 0.05, reject H0.

• If the value of the test-statistic is greater than the critical value, reject H0.

The p-value is more significant than 0.1, which is bigger than 0.05. In addition, the test-statistical value is less significant than the critical value (0.068 smaller than 0.463). From these results, it can be concluded that now the series is stationary (H0 rejected).

Building ARIMA Model

Based on the results of the autocorrelation analysis in the previous section, two ARIMA models will be proposed in this section. From the results of the ACF and PACF plots after first differencing, there is a cut-off on lag one for the ACF plot (MA(1)) and a cut-off on lag two on the PACF plot (AR(2)). Therefore, the ARIMA model that will be proposed is ARIMA (2, 1, 1) and uses the ‘auto.arima’ function to generate optimal p, d, and q values from time-series data.

Using the ‘auto.arima’ function, the optimal p, d, and q are ARIMA (2, 1, 2).

Summary of ARIMA Models

From these two figures, it can be concluded that the model generated by the auto Arima function has better AIC, AICc, and BIC values compared to ARIMA (2, 1, 1).

.: The lower the AIC, AICc, and BIC values, the better ARIMA model.

Adequacy of Each ARIMA Models

In order to test the adequacy of Arima models, the Ljung-Box test will be used. The following is the hypothesis for the Ljung-Box test.

> Ljung-Box test

* Hypothesis:

• H0: Errors are independent (model is adequate).

• H1: Errors are not independent (model is not adequate).

* Criteria:

• If the p-value is < 0.05, reject H0.

•The histogram of residuals is a normal distribution.

No trend and seasonality in residuals plot.

No correlation between residuals in the ACF plot.

Based on the behaviour in residual plot for ARIMA (2, 1, 1), it can be seen that there is no trend and seasonality in the plot (mean value is 0). In addition, there are four significant spikes in the ACF plot for the residuals, where there is still a slight correlation between residuals in the ACF plot. Moreover, the histogram of the residuals follows a normal distribution. In the Ljung-Box test, it can be seen that the p-value (0.025) < 0.05. This implies that the residuals are not following the white noise process (the model is not adequate). In conclusion, H0 rejected.

Based on the behaviour in residual plot for auto ARIMA, it can be seen that there is no trend and seasonality in the plot (mean value is 0). In addition, there are two significant spikes in the ACF plot for the residuals, where it is still very little/almost no correlation between residuals in the ACF plot. Moreover, the histogram of the residuals follows a normal distribution. In the Ljung-Box test, it can be seen that the p-value (0.5209) > 0.05. This implies that the residuals follow the white noise process (the model is adequate). In conclusion, H0 accepted.

Significance of Parameter Coefficients

In the ARIMA (2,1,1), the coefficients of ar1 and ar2 are significant because the p-value of the z-test is less than 0.05. Meanwhile, the coefficient of ma1 is not significant because the p-value of the z-test is more than 0.05. It is suggested to revise the model or propose a new ARIMA model.

In the auto Arima model, the coefficients of ar1, ar2, and ma2 are significant because the p-value of the z-test is less than 0.05, While the coefficient of ma1 is not significant because the p-value of the z-test is more than 0.05. Since the ma1 is not significant, but the ma2 is significant, it is advised to revise the model by including ma1 and ma2.

Forecast Errors

RMSE value and mean error (ME) generated by the auto ARIMA model has a lower value than ARIMA (2,1,1). However, the MAE value produced by the auto ARIMA model is slightly higher than the ARIMA (2,1,1). From these results, it can be concluded that auto ARIMA with ARIMA (2,1,2) has the best performance compared to ARIMA ARIMA (2,1,1).

Conclusions

Based on the summary, adequacy, and significance parameter coefficients of the two ARIMA models, it can be concluded that the auto Arima model with ARIMA (2,1,2) is the best compared to ARIMA (2,1,1). In the next section, forecasting will be carried out using ARIMA (2,1,2).

Forecast using Best ARIMA Model

It can be seen that the ARIMA (2,1,2) model can have good significance in training and forecast results.

This second article will explain how to forecast daily new COVID-19 confirmed cases in Maldives. Check out my first article about forecasting using Simple Exponential Smoothing, Linear Trend Model, and Holt-Winters Smoothing here.

All codes are available here

References

--

--

A place to share and learn about anything related to Data Science curated by Data Science Indonesia members for Data Science People.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store