Crash Course in Forecasting Participation: Time Series Analysis

Bowen Jin
AI Skunks
Published in
9 min readApr 10, 2023

Authors

Peichen Han, Bowen Jin

Data set and our target

The provided dataset contains information about water bodies in the Tuscany region of Italy. There are four different waterbodies included in the dataset: aquifers, rivers, lakes, and rainfall. For each waterbody, there are various features included, such as water levels, temperature, volume, and rainfall, among others. The data covers a period of several years, depending on the waterbody, and the frequency of the data varies from daily to monthly measurements. The dataset is intended for use in predicting future water levels or other features for each waterbody.

We will be using the Lake Bilancino dataset for our time series forecasting. Specifically, we will be forecasting the ‘Lake_Level’ feature, which represents the depth of the water level in the lake.

Target

We hope that readers can learn how to draw time series graphs and what is ARIMA model-based time series forecasting.

Access Dataset

Lake_input = pd.read_csv('../content/sample_data/Lake_Bilancino.csv')

Lake = Lake_input.copy()

print(Lake.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6603 entries, 0 to 6602
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 6603 non-null object
1 Rainfall_S_Piero 6026 non-null float64
2 Rainfall_Mangona 6026 non-null float64
3 Rainfall_S_Agata 6026 non-null float64
4 Rainfall_Cavallina 6026 non-null float64
5 Rainfall_Le_Croci 6026 non-null float64
6 Temperature_Le_Croci 6025 non-null float64
7 Lake_Level 6603 non-null float64
8 Flow_Rate 6582 non-null float64
dtypes: float64(8), object(1)
memory usage: 464.4+ KB
None

Time series

For each subplot, the title indicates the corresponding feature name and the y-label indicates the unit of measurement. The plots show the overall trend and seasonality of each feature over time, as well as any irregular patterns or outliers.

The visualization can provide insights into the characteristics of each feature, such as the periodicity and seasonality of the ‘Lake_Level’ feature, the sharp decrease in ‘Flow_Rate_Mangona’ around 2011, and the increasing trend in ‘Rainfall_S_Piero’ and ‘Rainfall_Mangona’ after 2017. This information can be useful in understanding the behavior of the dataset and selecting appropriate models for time series forecasting.

Lake Bilancino

ARIMA Model Analysis

The basic idea of the ARIMA model is to regard the data sequence formed by the forecast object over time as a random sequence and use a certain mathematical model to approximate this sequence. This model, once identified, can predict future values from the past and present values of the time series. Modern statistical methods and econometric models have been able to help companies predict the future to some extent.

The basic steps of model building are as follows:

  1. Obtain the time series data of the observed system.
  2. Plot the data and determine whether it is a stationary time series. For non-stationary time series, perform a differencing operation to convert it into a stationary time series. If it is already stationary, proceed to the next step.
  3. Obtain the autocorrelation coefficient ACF and partial autocorrelation coefficient PACF for the stationary time series. Analyze the ACF and PACF plots to determine the best values for the autoregressive parameter p and the moving average parameter q.
  4. Using the values of d, p, and q, construct the ARIMA model. Check the model by examining its residual plots to ensure it is consistent with the characteristics of the observed data. If not, go back to step 3 and adjust the values of p and q accordingly.

But here we use ADF instead of ACF for stationary check.

Stationary check

What is stationary

When we talk about time series, there are two types of stationarity: Strict-sense stationary and wide-sense stationarity.
Strict-sense stationary time series requires that every two subsequences with any given length in the time series satisfy the same joint distribution. Consequently, all statistical properties of the sequence do not change when shifted in time. Obviously, It's a very strict condition in practice. So we still have weak wide-sense stationarity.
Wide-sense stationarity(WSS) only requires that 1st moment (i.e. the mean) and autocovariance do not vary with respect to time and that the 2nd moment is finite for all times. Any strictly stationary process which has a finite mean and a covariance is also WSS.

Obviously, as long as the series is stationary, we can use past data to predict the future. For this, we need a way to detect whether the series is stationary. This way is augmented Dickey-Fuller test.

ADF

In statistics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models.

The augmented Dickey–Fuller (ADF) statistic, used in the test, is a negative number. The more negative it is, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence.

In python, you can use existing tools for ADF.

from statsmodels.tsa.stattools import adfuller.

Take the temp time series in this dataset as an example.

Lake = Lake.rename(columns={'Temperature_Le_Croci':'tem','Date':'date'})
tem_adf = adfuller(Lake['tem'].values)
tem_adf

tem_adf = adfuller(Lake['tem'].values)
tem_adf
(-4.607163579535117,
0.00012529899651088334,
16,
6008,
{'1%': -3.4314388974936767,
'5%': -2.8620211927157615,
'10%': -2.5670261364086677},
24542.31575729264)

In tem_adf, the parameters are: {
Test Statistic Value,
p-value,
Lags Used,
Number of Observations Used,
{Critical Value(1%),
Critical Value(5%),
Critical Value(10%)},
icbest }

If the p-value is equal to or below the predetermined significance level of 0.05, this indicates that the data is considered stationary. In regards to the temperature time series we are analyzing, the p-value is observed to be 0.000125, which is less than 0.05, highlighting that it is already stationary. However, we will still discuss the methods for making a time series stationary.

How to make time-series stationary

In general, for this problem, we have two approaches:

  1. Make a transformation: for example, a logarithm or a square root to stabilize a non-constant variance;

2. Differentiate: subtract the current value from the previous one.

Differentiate

tem_diff = Lake['tem'].diff(1)
tem_diff_adf = adfuller(tem_diff[1:].values)
tem_diff_adf
(-26.28195776196397,
0.0,
15,
6008,
{'1%': -3.4314388974936767,
'5%': -2.8620211927157615,
'10%': -2.5670261364086677},
24558.0834117153)

Our p-value < 0.05, so we do not need to Differentiate again.

Choose value of p and q

For the classic baseline model, we will use ARIMA. ACF and PACF graphs: After the time series has been stationary by differentiation, the next step in fitting the ARIMA model is to determine whether the terms AR or MA are needed to correct any autocorrelation that remains in the differentiated series. By looking at the graphs of the autocorrelation function (ACF) and partial autocorrelation (PACF) of the difference series, we will be able to pre-determine the number of required AR and/or MA terms.

Autocorrelation function (ACF): P = Periods with a delay, for example: (if P = 3, then we will use the three previous periods of our time series in the autoregressive part of the calculation) P helps to adjust the line that is set to predict the series. P corresponds to the parameter MA

Partial Autocorrelation Function (PACF): D = In the ARIMA model, we transform a time series into a stationary one (a series without trend or seasonality) using differentiation. D refers to the number of different transformations required by the time series to achieve stationarity. D corresponds to the AR parameter. Autocorrelation graphs help determine seasonality.

Based on the graph below, autocorrelation continues to decrease as the delay increases, confirming the absence of a linear relationship between observations separated by large “delays”.

For the AR process, it is expected that the ACF schedule will gradually decrease, and at the same time, the PACF should have a sharp drop after p significant delays. To determine the MA process, we expect the opposite from the ACF and PACF graphs, which means that: ACF should show a sharp drop after a certain qth number of shifts, while PACF should show a geometric or gradual downward trend.

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(16, 8))
# caculate acf and pacf
plot_acf(data_diff,lags=30, ax=ax[0])
plot_pacf(data_diff,lags=30, ax=ax[1])
plt.show()

When choosing the values of p and q, in addition to observing ACF and PACF, we can also use AIC and BIC.

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

The Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

In this article, we use AIC to choose p and q. Through the following code, we can get the combination of p and q with the smallest AIC. However, in order to reduce calculation, here we set maximum of p is 6 and maximum of q is 4.

sm.tsa.arma_order_select_ic(data,max_ar=6,max_ma=4,ic='aic')['aic_min_order']
(2, 3)

In summary, our ARIMA model’s parameters p and q were determined by utilizing the Akaike information criterion (AIC) — a reliable method for assessing a model’s quality compared to others. By leveraging the arma_order_select_ic function in the stats models library, we found that our dataset performed best when p=2 and q=3. This allowed us to confidently select the most suitable parameters for our ARIMA model. Since this dataset is originally stationary then d = 0.

Generate model

model = ARIMA(data, order=(2,3,0)).fit()
model.summary()
SARIMAX Results
Dep. Variable: tem No. Observations: 6025
Model: ARIMA(2, 3, 0) Log Likelihood -15619.888
Date: Sat, 08 Apr 2023 AIC 31245.777
Time: 04:44:11 BIC 31265.886
Sample: 01-02-2004 HQIC 31252.759
- 06-30-2020
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.9616 0.010 -95.594 0.000 -0.981 -0.942
ar.L2 -0.4725 0.010 -45.075 0.000 -0.493 -0.452
sigma2 10.4812 0.155 67.677 0.000 10.178 10.785
Ljung-Box (L1) (Q): 226.08 Jarque-Bera (JB): 291.34
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 1.08 Skew: 0.09
Prob(H) (two-sided): 0.08 Kurtosis: 4.06

Predict with model

predictions_ARIMA = pd.Series(model.fittedvalues, copy=True)
predictions_ARIMA
date
2004-01-02 0.000000
2004-01-03 13.000018
2004-01-04 0.291693
2004-01-05 -0.850156
2004-01-06 0.007128
...
2020-06-26 21.889392
2020-06-27 23.011570
2020-06-28 24.564236
2020-06-29 19.835548
2020-06-30 24.023303
Length: 6025, dtype: float64

Result Visualization

plt.figure(figsize=(40, 20))
plt.plot(data,label="data")
plt.xlabel('date',fontsize=12,verticalalignment='top')
plt.ylabel('tem',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()
plt.figure(figsize=(40, 20))
plt.plot(predictions_ARIMA,label="forecast")
plt.xlabel('date',fontsize=12,verticalalignment='top')
plt.ylabel('tem',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()
plt.figure(figsize=(40, 20))
plt.plot(predictions_ARIMA,label="forecast")
plt.plot(data,label="data")
plt.xlabel('date',fontsize=12,verticalalignment='top')
plt.ylabel('tem',fontsize=14,horizontalalignment='center')
plt.legend()
plt.show()

Advantages and Disadvantages of ARIMA, and result analysis

In relation to the dataset of Lake Bilancino, an ARIMA model was utilized to predict the lake level as time progressed. The outcomes demonstrated that the model had the ability to generally depict the trend of the initial data, but fell short in accurately representing the data. This can be attributed to the uncomplicated nature of the ARIMA model, which possesses limitations when it comes to fitting in situations involving nonlinear relationships within data. In spite of this, this model still provided valuable comprehensive knowledge and observations about the trends and formations present at the lake level as time progressed. In summary, the Lake Bilancino dataset serves as an effective case study for examining various time-series forecasting techniques and their practical applications in real-world circumstances.

Reference:

[1] Data Source: ACEA Water Prediction Competition. (n.d.). Retrieved from https://www.kaggle.com/c/acea-water-prediction/data

[2] Tutorials: Gupta, P. (n.d.). ARIMA Model — Time Series Forecasting (Python). Machine Learning Plus. Retrieved from https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

[3] Research Papers: Mannocchi, F., Carleo, E., & Capparelli, G. (2020). Water Table Level Prediction Using Machine Learning Techniques: An Application to a Shallow Aquifer in Central Italy. Water, 12(7), 1965. https://doi.org/10.3390/w12071965

[4] Ziadat, F. M., Al-Ansari, N., Abdellatif, M., & Senan, A. (2015). Groundwater Level Prediction Using ARIMA Modeling Technique: A Case Study in Basrah Province. Environmental Earth Sciences, 74(8), 6879–6895. https://doi.org/10.1007/s12665-015-4725-5

[5] Blog Posts: Mansoor, M. (2020, July 23). Time Series Forecasting using ARIMA models. Towards Data Science. Retrieved from https://towardsdatascience.com/time-series-forecasting-using-arima-models-6f33eb80a0d6

--

--