Introduction to Time Series: Part 2 Forecasting

Published in

Time Series Using Python

8 min readApr 19, 2024

Recap :

Part 1: https://medium.com/time-series-using-python/introduction-to-time-series-analysis-part-1-3ed1ce9cac98

Part 1.1: https://medium.com/time-series-using-python/introduction-to-time-series-analysis-part-1-1-stationarity-0e239c9cc039

In Part 1 of the Introduction to Time-Series, we have

Explanation of stationarity and its importance in time series data.
Techniques for checking stationarity, including rolling statistics and the Augmented Dickey-Fuller test.
Visualizations to show changes in the data over time, using rolling means and standard deviations.
Use of autocorrelation and partial autocorrelation functions to analyze time dependencies.
Practical example using elevation gain data to demonstrate how to compute and interpret autocorrelations at different lags.

In this Lab, we will attempt at forecasting the future. Using our elevation gain example, we will aim to predict our elevation at t + 5, t+10 and t + 20. We will then compare it the actual values and examine the accuracy of the model.

The SETUP

Let’s say we have walked a total of 120 minutes. We will build a model that will predict the elevation gain. To do so, we will take the first 100 minutes as the training data and next 20 minutes as the testing data. We will do so and test out the Auto-Regressive Models :)

# In previous part, we took an example of hiking on top of a mountain and measure elevation gained per minute.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, acf
from statsmodels.graphics.tsaplots import plot_acf

minutes = np.arange(0, 120)
training_minutes = minutes[0:100]
testing_minutes  = minutes[100 : 120]

elevation_gain = 1000 + 3*minutes + 50 * np.sin(2 * np.pi * minutes / 20) + np.random.normal(0, 10, size=120)

# Split data into training and testing sets
training_elevation_gain = elevation_gain[0  : 100]
testing_elevation_gain  = elevation_gain[100: 120]

# Create DataFrame from training data for easier manipulation
df_train = pd.DataFrame(index = training_minutes, data = {"Elevation": training_elevation_gain})
df_test  = pd.DataFrame(index = testing_minutes, data = {"Elevation" : testing_elevation_gain})
# Plot the elevation gain
plt.figure(figsize=(10, 5))
plt.plot(training_minutes, training_elevation_gain, label='Training Data', marker='o', linestyle='-')
plt.plot(testing_minutes,  testing_elevation_gain,  label='Testing Data', marker='*', linestyle='-')
plt.title('Elevation Gain as You Hike Up a Mountain')
plt.xlabel('Minutes')
plt.ylabel('Elevation (Feet)')
plt.legend()
plt.show()

Auto-Regressive Models

what exactly auto-regressive models do in Time series ?

Auto-regressive (AR) models are a type of time series model that uses the past values of a variable to predict its future values. AR models are based on the assumption that the current value of a variable is linearly related to its past values.

The general form of an AR model is:

Yt=ϕ1Yt−1+ϕ2Yt−2+…+ϕpYt−p+ϵt

where:

Yt is the current value of the variable
Yt−1,Yt−2,…,Yt−p are the past values of the variable
ϕ1,ϕ2,…,ϕp are the model parameters
ϵt is the error term
p is the ORDER of the MODEL (number of lags)

Note: AutoReg estimates the coefficients for the lags using Ordinary Least Squares (OLS) regression. This method minimizes the sum of the squared differences between observed values and the values predicted by the model.

Example :

Auto-Regressive model in the case of elevation gain will look at past 20 minutes data to predict elevation for the next 5, 10, 20 minutes. This model will assume that the present value of the data has a linear relationship with the past data. With this assumption, it will try to find the “most optimized” value of the model parameters ϕ1,ϕ2,…,ϕp

We will attempt to train a model that can use the “Blue” colored data points to predict the “Yellow colored” data points

BUT BEFORE we try to do any predictions, we have to make sure that our Time Series is stationary. In the previous lab, it was established that DIFFERENCING makes the Time Series Stationary. Let’s do this then!

# Step 1: Test for stationarity
# ADF Statistic is the output of the Augmented Dickey-Fuller test which tests for stationarity.
# p-value indicates the probability that the series has a unit root and is non-stationary.
# Critical values show the thresholds for the ADF statistic at which the null hypothesis can be rejected.
result = adfuller(df_train)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
result

p-value is much greater than ANY of the critical value. Thus, its highly likely that our time-series is non-stationary. this is true, because the average elevation is changing with time. Let’s now difference the time series and test if the series is stationary or not

# Dictionary to hold the results
import warnings
from statsmodels.tsa.stattools import adfuller, kpss

warnings.filterwarnings('ignore')
results = {}

# List of lags for differencing
lags = np.arange(1,15)

# Function to perform ADF and KPSS tests
def perform_tests(data):
    adf_test   = adfuller(data.dropna())
    adf_stat   = adf_test[0]
    adf_p      = adf_test[1]
    adf_critic = adf_test[4]
    kpss_test  = kpss(data.dropna(), regression='c')
    adf_result = "Stationary" if adf_stat < adf_critic["5%"] else "Non-stationary"
    kpss_stat  = kpss_test[0]
    kpss_result = "Stationary" if kpss_stat < kpss_test[3]['5%'] else "Non-stationary"
    return {
        'ADF Statistic': adf_test[0],
        'ADF critic 5%': adf_critic["5%"],
        'KPSS Statistic': kpss_test[0],
        'KPSS p-value': kpss_test[1],
        'Adf_Result' : adf_result,
        "KPSS_Result" : kpss_result
    }

# Apply differencing and perform tests
for lag in lags:
    lagged_data = df_train.diff(lag)
    results[f'Diff_{lag}'] = perform_tests(lagged_data)

# Lets see if the Time Series became stationary post differencing or not?

results_df = pd.DataFrame(results).T
results_df

The above results of differencing suggests that DIFFERENCING OF ORDER 1 has rendered the time series stationary. Let us now use differencing order = 1 and make our time-series stationary and then re-verify the ADF Criteria

# First Difference to remove trend
training_elevation_gain_diff = np.diff(training_elevation_gain, n=1)

# Re-test for stationarity
result_diff = adfuller(training_elevation_gain_diff)
print('ADF Statistic: %f' % result_diff[0])
print('p-value: %f' % result_diff[1])
print('Critical Values:')
for key, value in result_diff[4].items():
    print('\t%s: %.3f' % (key, value))

# Step 2: Determine Lag Order
# Plot the Autocorrelation Function (ACF)
plt.figure()
plot_acf(df_train, alpha=0.05)
plt.title('ACF for Training Elevation Gain')
plt.show()

Forecasting with AR model

We have established that differencing with order of 1 will make the time-series stationary. Further, a lag order of 1 has high auto-correlation thus we can use it as a lag_order for AR model. This means the model will try to fit equation and try to find the optimal value of \phi_1

Yt=ϕ1Yt−1+ϵt

from statsmodels.tsa.ar_model import AutoReg

# Assuming a lag order p is chosen based on ACF
chosen_lags = 1  # Example, adjust based on ACF observation

# Fit an AutoReg model on differenced data
model = AutoReg(training_elevation_gain_diff, lags=chosen_lags)
model_fit = model.fit()

# Check model summary
print(model_fit.summary())

# # Forecasting future values using the model
forecast = model_fit.predict(start=len(training_elevation_gain_diff), end=len(training_elevation_gain_diff) + 19)

AutoReg Model Results shows 4 metrics :

AIC
BIC
HQIC
Log Likelihood

If the model fit is good, the log-likelihood must be as high as possible & AIC, BIC, HQIC must be as low as possible.

From initial inspection, it seems like the model has not been fit properly. Let’s plot the results and see.

Remember that we had differenced the data. Thus, we will have to reintegrate the forecast with the original data

forecast
# Output :
array([3.77644964, 2.87244836, 2.82853232, 2.82639889, 2.82629525,
       2.82629022, 2.82628997, 2.82628996, 2.82628996, 2.82628996,
       2.82628996, 2.82628996, 2.82628996, 2.82628996, 2.82628996,
       2.82628996, 2.82628996, 2.82628996, 2.82628996, 2.82628996])

# Reintegrating the forecast:
# Start with the last known value from the original non-differenced data
last_original_value = training_elevation_gain[-1]

# Reintegrate the forecasted differences
forecast_values = [last_original_value + sum(forecast[:i+1]) for i in range(len(forecast))]

# Plotting the original data and the forecasted values
plt.figure(figsize=(10, 5))
plt.plot(training_minutes, training_elevation_gain, label='Training Data')
plt.plot(testing_minutes, testing_elevation_gain, label = "Testing Elevation Gain")
plt.plot(testing_minutes, forecast_values, label='Forecasted Data', color='red')
plt.title('Original Data and Forecasted Data')
plt.xlabel('Minutes')
plt.ylabel('Elevation Gain')
plt.legend()
plt.show()

From the above results, it seems like the model has been able to predict only the linear component of the time-series and has done extremely poorly with respect to the seasonal component.

Let’s now try to find a series of lag-orders to find the best match possible


# Initialize variables to store the best results
best_aic = np.inf
best_lag = None
best_model = None

# Loop over possible lags from 0 to 20
for lag in range(11):  # including 0 to 20
    if lag == 0:
        continue  # skip lag 0 because AutoReg requires at least one lag
    model = AutoReg(training_elevation_gain_diff, lags=lag)
    model_fit = model.fit()

    # Check if the current model's AIC is lower than what we have found so far
    if model_fit.aic < best_aic:
        best_aic = model_fit.aic
        best_lag = lag
        best_model = model_fit

# Print the best model summary
if best_model is not None:
    print(f"Best Lag: {best_lag}")
    print(f"Best AIC: {best_aic}")
    print(best_model.summary())


else:
    print("No suitable model was found.")

The above search over different lags shows that the best lag for our case is 20. Let’s take a few lag orders and see the results

# Forecast future values if needed, adjust indices as required

 def model_fit_plot(training_elevation_gain_diff, lag):
  model = AutoReg(training_elevation_gain_diff, lags=lag)
  model_fit = model.fit()
  forecast = model_fit.predict(start=len(training_elevation_gain_diff), end=len(training_elevation_gain_diff) + 19)
  # Reintegrating the forecast:
  # Start with the last known value from the original non-differenced data
  last_original_value = training_elevation_gain[-1]

  # Reintegrate the forecasted differences
  forecast_values = [last_original_value + sum(forecast[:i+1]) for i in range(len(forecast))]

  # Plotting the original data and the forecasted values
  plt.figure(figsize=(10, 5))
  plt.plot(training_minutes, training_elevation_gain, label='Training Data')
  plt.plot(testing_minutes, testing_elevation_gain, label = "Testing Elevation Gain")
  plt.plot(testing_minutes, forecast_values, label='Forecasted Data at Lag Order = ' + str(lag), color='red')
  plt.title('Original Data and Forecasted Data')
  plt.xlabel('Minutes')
  plt.ylabel('Elevation Gain')
  plt.legend()
  plt.show()

model_fit_plot(training_elevation_gain_diff, 5)

model_fit_plot(training_elevation_gain_diff, 10)

You can see that the model is able to predict forecasted values with high accuracy for lag order of 10. Why is that the case?

Strong Autocorrelation: The data might has a strong autocorrelation (0.5) at lag 10, meaning that the values in the series are highly dependent on their values 10 periods ago. This strong dependency is well-captured by the AR model with lag 10.
Underlying Patterns: The time series has a seasonal pattern (peak to peak in 20 minutes) that repeat every 10 periods or its multiples. The model has captured this pattern due to higher order lag terms

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

I will be posting more of these on my LinkedIn page: https://www.linkedin.com/in/vipul-joshi-4b249155/
https://colab.research.google.com/drive/1aH-miv8Lg59K0KBdknARK7oN64WCOuGV?usp=sharing#scrollTo=kP0a3LQ-riB5