Mastering Forecasting: Unveiling the Power of VAR Modeling for Dynamic Time Series Prediction
1. Introduction to VAR Modeling
Vector Autoregression (VAR) is a pivotal statistical model in econometrics and finance, designed for forecasting systems where multiple time series variables are interdependent. Developed by Christopher Sims in 1980, VAR models differ from single-equation models by capturing the dynamic interplay between several variables.
The essence of VAR is its ability to let the data reveal its own story, making it especially valuable for empirical analysis in economics and finance. It’s used for:
- Forecasting Economic Indicators: Like GDP growth and inflation rates.
- Analyzing Monetary Policies: Understanding central bank actions.
- : Examining the interaction between stock prices and interest rates.
For those looking to delve deeper:
- “Time Series Analysis” by James D. Hamilton provides an extensive introduction to VAR.
- “New Introduction to Multiple Time Series Analysis” by Helmut Lütkepohl covers the theory and application of these models.
- The section on “Vector Autoregressive and Vector Error-Correction Models” in “Econometric Analysis” by William H. Greene offers an accessible overview.
In this post, we’ll explore how VAR can be applied to practical scenarios, such as predicting electricity prices using related time series data.
2. Understanding the Data
Successful VAR modeling begins with a deep understanding of the data at hand. This step is crucial because the quality and characteristics of your data directly influence the accuracy and reliability of your model.
- Multivariate Nature: VAR models thrive on the interdependence of multiple time series. For example, when predicting electricity prices, consider how they might relate to weather conditions and gas prices. Each variable should potentially influence or be influenced by others.
- Stationarity Requirement: A fundamental aspect of VAR modeling is stationarity. Simply put, the statistical properties of your series (like mean and variance) should be constant over time. Non-stationary data can lead to misleading results, making it essential to test and, if necessary, transform your data to achieve stationarity.
- Data Frequency and Consistency: Ensure your time series data are consistently measured over time. Whether it’s daily, monthly, or quarterly data, uniformity in frequency is key to maintaining the integrity of your VAR model.
- Length of the Time Series: The amount of data you have matters. Longer time series can provide more insights but also pose challenges like increased complexity and potential for overfitting. Striking a balance is crucial.
- Data Quality: Assess the data for missing values, outliers, or errors. High-quality, clean data forms the backbone of any reliable econometric model.
- Historical Context: Understand the historical context of your data. Economic and financial time series are often influenced by one-off events, policy changes, or economic cycles. Recognizing these can help in interpreting your model’s output more accurately.
- Preliminary Analysis: Before jumping into VAR modeling, perform exploratory data analysis. Visualize your series, look for trends, seasonal patterns, and potential structural breaks. This step provides valuable insights and guides the subsequent modeling process.
In summary, your data isn’t just a series of numbers; it’s a story waiting to be told. The better you understand this story, the more effectively you can use VAR modeling to uncover the intricate relationships within it.
3. Preparing the Data
- Data preparation is a critical step in VAR modeling. For our project on predicting electricity prices using gas prices and historical weather data, we undertook a detailed cleaning and transformation process.
- Data Importation and Simplification: We started by importing datasets for electricity prices, gas prices, and historical weather. Each dataset was simplified to include only relevant variables.
- Consistency in Time Series: We converted the indices of all datasets to datetime objects for uniformity and resampled them to daily frequency. This ensures that our time series data have a consistent daily interval, crucial for accurate VAR analysis.
- Combining Data Sources: The datasets were then combined into a single DataFrame. This process involved:
- Averaging the electricity and gas prices to daily values.
- Selecting weather variables with a coefficient of variation greater than 0.5, ensuring we focus on variables with significant variability.
- Dropping highly correlated weather variables (correlation > 0.8) to avoid multicollinearity.
4. Final Dataset: The final dataset includes daily values of electricity and gas prices, alongside key weather indicators such as temperature, rain, snowfall, and various cloud cover measures.
The snippet of the processed data looks like this:
electricity_price_per_mwh gas_price_per_mwh temperature rain snowfall cloudcover_total cloudcover_mid cloudcover_high shortwave_radiation
2021-09-01 110.47 45.78 13.35 0.0 0.0 24.5 12.5 0.0 60.5
2021-09-02 112.63 45.96 12.55 0.0 0.0 47.0 42.5 0.0 29.0
...
2023-05-28 44.45 31.10 13.70 0.0 0.0 17.0 12.0 1.0 201.0
2023-05-29 35.76 32.57 10.85 0.0 0.0 46.5 39.0 26.5 225.0
2023-05-30 55.55 31.50 8.70 0.0 0.0 35.0 56.0 0.0 24.0
4. Stationarity check and process
4.1 Stationarity test
In this section, we delve into the critical process of Testing for Stationarity in our time series data. Stationarity is a foundational assumption in VAR modeling. It implies that the statistical properties of the series — like mean and variance — remain constant over time. Detecting and rectifying non-stationarity is crucial because it can lead to unreliable model predictions and interpretations.
To test for stationarity, we employed the Augmented Dickey-Fuller (ADF) test. This test helps us determine whether each time series in our dataset (electricity prices, gas prices, temperature, etc.) contains a unit root, indicating non-stationarity.
The ADF test results reveal that while some series like electricity prices and certain weather variables are stationary, others like gas prices and temperature are not. Non-stationary series exhibit trends or other characteristics that change over time, making them unsuitable for direct use in VAR modeling. For these series, we’ll need to apply techniques such as differencing or decomposition to remove trends and stabilize their statistical properties. Once transformed into a stationary form, these series can then be retested and potentially included in the VAR model.
4.2 Making time series stationary
In this section, we address the challenge of transforming our non-stationary time series data into a form that’s more suitable for VAR modeling. We focus on making the series for electricity prices, gas prices, temperature, and shortwave radiation stationary. To achieve this, we employ Holt’s Linear Trend method, which is a powerful tool for smoothing time series data.
import pandas as pd
import numpy as np
from statsmodels.tsa.holtwinters import Holt
import matplotlib.pyplot as plt
electricity_price = total_data['electricity_price_per_mwh']
gas_price_series = total_data['gas_price_per_mwh']
temperature_series = total_data['temperature']
radiation_series = total_data['shortwave_radiation']
electricity_price_model = Holt(total_data['electricity_price_per_mwh'])
electricity_price_results = electricity_price_model.fit(smoothing_level=0.1)
electricity_price_trend = electricity_price_results.fittedvalues
electricity_price_level = electricity_price_results.level
electricity_price_residual = total_data['electricity_price_per_mwh'] - electricity_price_trend
gas_price_model = Holt(total_data['gas_price_per_mwh'])
gas_price_results = gas_price_model.fit(smoothing_level=0.1)
gas_price_trend = gas_price_results.fittedvalues
gas_price_level = gas_price_results.level
gas_price_residual = total_data['gas_price_per_mwh'] - gas_price_trend
temperature_model = Holt(total_data['temperature'])
temperature_results = temperature_model.fit(smoothing_level=0.1)
temperature_trend = temperature_results.fittedvalues
temperature_level = temperature_results.level
temperature_residual = total_data['temperature'] - temperature_trend
radiation_model = Holt(total_data['shortwave_radiation'])
radiation_results = radiation_model.fit(smoothing_level=0.1)
radiation_trend = radiation_results.fittedvalues
radiation_level = radiation_results.level
radiation_residual = total_data['shortwave_radiation'] - radiation_trend
We choose a smoothing_level
of 0.1 to ensure that the level component of our series is smoothed out significantly. This choice is deliberate as it allows us to focus on long-term trends rather than short-term fluctuations. By doing so, we mitigate the impact of short-term volatility and emphasize the underlying trend in each time series.
Furthermore, we calculate the residual of each series as the difference between the actual values and their respective estimated trends. The rationale behind this is twofold: firstly, residuals often exhibit stationary characteristics, making them more appropriate for VAR modeling. Secondly, in future steps, we intend to forecast both the trend and residual components separately and then combine these forecasts to construct the prediction for the main time series. This approach enables us to capture both the underlying trend and the random fluctuations inherent in the time series data.
In the figure presented below, we showcase a visual representation of our time series along with their decomposed components.
5. Determining the Order of VAR
In VAR (Vector Autoregression) modeling, determining the correct order (number of lags) is crucial for the model’s accuracy and efficiency. The order of a VAR model indicates the number of previous time points (lags) used to predict the current value in the series. An appropriately chosen lag order captures the essential dynamics of the system without overfitting the data.
To determine the optimal lag order, we often rely on information criteria such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), or the Hannan-Quinn Information Criterion (HQIC). These criteria balance model complexity (number of lags) with goodness of fit, with lower values indicating a better model.
import pandas as pd
from statsmodels.tsa.api import VAR
data_for_var = pd.concat([total_data[['rain','snowfall', 'cloudcover_total', 'cloudcover_mid', 'cloudcover_high']],
electricity_price_residual, gas_price_residual, temperature_residual, radiation_residual], axis=1, join='inner')
# Fitting a VAR model
model = VAR(data_for_var)
# Determining the optimal lag order
lag_order_results = model.select_order(maxlags=10)
print(lag_order_results.summary())
VAR Order Selection (* highlights the minimums)
==================================================
AIC BIC FPE HQIC
--------------------------------------------------
0 30.26 30.32 1.387e+13 30.29
1 26.99 27.63* 5.258e+11 27.24*
2 26.94* 28.16 5.037e+11* 27.42
3 27.03 28.82 5.502e+11 27.73
4 27.12 29.48 5.997e+11 28.03
5 27.18 30.11 6.381e+11 28.32
6 27.24 30.75 6.797e+11 28.60
7 27.32 31.40 7.409e+11 28.91
8 27.38 32.04 7.869e+11 29.19
9 27.50 32.73 8.898e+11 29.53
10 27.59 33.39 9.763e+11 29.84
--------------------------------------------------
6. Estimating the VAR Model
After determining the optimal lag order for our VAR model, the next step is to estimate the VAR model itself. This involves fitting the model to our time series data with the chosen lag order. Estimation of a VAR model provides us with the coefficients that describe the relationships between each pair of time series in the system. These coefficients are essential for understanding how changes in one variable affect others and for forecasting future values of the time series.
# Assuming 'data_for_var' contains the relevant time series data
model = VAR(data_for_var)
# Fitting the model with the chosen lag order
model_fitted = model.fit(2, ic='aic', trend='ct')
# model_fitted.summary()
Results for equation electricity_price_residual
================================================================================================
coefficient std. error t-stat prob
------------------------------------------------------------------------------------------------
const 5.566887 6.297416 0.884 0.377
trend 0.000486 0.011490 0.042 0.966
L1.rain 14.412011 40.236325 0.358 0.720
L1.snowfall 168.651593 73.688386 2.289 0.022
L1.cloudcover_total -0.110353 0.092959 -1.187 0.235
L1.cloudcover_mid 0.063661 0.087542 0.727 0.467
L1.cloudcover_high -0.014581 0.063907 -0.228 0.820
L1.electricity_price_residual 0.591264 0.041397 14.283 0.000
L1.gas_price_residual 0.815831 0.301679 2.704 0.007
L1.temperature_residual -2.201449 0.977021 -2.253 0.024
L1.radiation_residual 0.236744 0.076082 3.112 0.002
L2.rain -64.005815 40.195755 -1.592 0.111
L2.snowfall -8.597689 74.749992 -0.115 0.908
L2.cloudcover_total -0.166000 0.092752 -1.790 0.073
L2.cloudcover_mid 0.092476 0.087292 1.059 0.289
L2.cloudcover_high 0.091182 0.063286 1.441 0.150
L2.electricity_price_residual -0.121211 0.041038 -2.954 0.003
L2.gas_price_residual -0.178676 0.302302 -0.591 0.554
L2.temperature_residual 0.508086 0.949751 0.535 0.593
L2.radiation_residual -0.204459 0.076367 -2.677 0.007
================================================================================================
7. Diagnostic Checking
n the diagnostic phase of our VAR model, we used the Durbin-Watson statistic to assess autocorrelation in the residuals. Autocorrelation, especially in time series data, can be a sign of model misspecification and can invalidate some of the assumptions underlying the standard statistical tests.
The Durbin-Watson statistic ranges from 0 to 4, where:
- A value around 2 suggests no autocorrelation.
- Values substantially less than 2 indicate positive autocorrelation.
- Values much greater than 2 suggest negative autocorrelation.
Our Results:
from statsmodels.stats.stattools import durbin_watson
# Compute Durbin-Watson statistics
dw_stats = durbin_watson(model_fitted.resid)
for col, stat in zip(data_for_var.columns, dw_stats):
print(f'{col}: {stat}')
Rain: 2.013
Snowfall: 2.004
Cloudcover Total: 2.033
Cloudcover Mid: 2.016
Cloudcover High: 2.000
Electricity Price Residual: 2.002
Gas Price Residual: 2.007
Temperature Residual: 1.975
Radiation Residual: 2.007
All variables show Durbin-Watson statistics very close to 2, indicating that there is minimal to no autocorrelation in the residuals. This is a positive sign, suggesting that our VAR model captures the dynamics of the system well without leaving unexplained patterns in the residuals.
8. Forecasting with VAR
After estimating the model and ensuring it passes diagnostic checks, we can use it to predict future values of the included variables.
import matplotlib.pyplot as plt
# Forecasting
forecast_steps = 10 # Number of steps to forecast ahead
forecast = model_fitted.forecast(model_fitted.endog, steps=forecast_steps)
forecast_df = pd.DataFrame(forecast, index=pd.date_range(start=data_for_var.index[-1] + pd.Timedelta(days=1), periods=forecast_steps, freq='D'), columns=data_for_var.columns)
# Concatenate the predicted and last 50 observations of historical data for 'electricity_price_residual'
concatenated_data = pd.concat([data_for_var['electricity_price_residual'].tail(50), forecast_df['electricity_price_residual']])
# Plot the concatenated data
plt.figure(figsize=(12, 6))
plt.plot(data_for_var.index[-50:], data_for_var['electricity_price_residual'].tail(50), label='Historical', color='blue',marker = 'o')
plt.plot(forecast_df.index, forecast_df['electricity_price_residual'], label='Forecast', color='orange',marker = '*')
plt.title('Historical and Predicted Electricity Price Residual')
plt.xlabel('Date')
plt.ylabel('Values')
plt.legend()
plt.show()
9. Combining Component to make primary time series prediction
In our analysis, we’ve decomposed the electricity price time series into trend and residual components. We’ve already forecasted the residual component using a VAR model. To construct a comprehensive forecast of the electricity price, we now turn to predicting the trend component. For the trend, we’ll use a simpler approach: polynomial fitting. This method is effective for capturing the underlying trend in a time series, especially over a short horizon. By combining forecasts from these two distinct components — the trend and the residual — we can reconstruct a more nuanced and potentially more accurate forecast of the original electricity price series.
electricity_price = total_data['electricity_price_per_mwh']
# the fitted Holt model
electricity_price_model = Holt(total_data['electricity_price_per_mwh'])
electricity_price_results = electricity_price_model.fit(smoothing_level=0.1)
electricity_price_trend = electricity_price_results.fittedvalues
electricity_price_level = electricity_price_results.level
electricity_price_residual = total_data['electricity_price_per_mwh'] - electricity_price_trend
# Forecasting the trend component
forecast_steps = 10 # Number of steps to forecast ahead
# Forecasting future trend values
electricity_price_trend_forecast = electricity_price_results.forecast(steps=forecast_steps)
# Creating a date range for the forecast
electricity_price_trend_forecast_index = pd.date_range(start=electricity_price_trend.index[-1] + pd.Timedelta(days=1), periods=forecast_steps, freq='D')
# Creating a DataFrame for the forecasted trend
electricity_price_trend_forecast_df = pd.DataFrame(electricity_price_trend_forecast, index=electricity_price_trend_forecast_index, columns=['electricity_price_trend_forecast'])
# sum the predicted trend and predicted residual values to get the final forecast
forecast_df['electricity_price_residual_final'] = forecast_df['electricity_price_residual'] + electricity_price_trend_forecast_df['electricity_price_trend_forecast']
# Plotting the final forecast of original electricity_price
plt.figure(figsize=(12, 6))
plt.plot(electricity_price[-100:], label='Historical', color='blue')
plt.plot(forecast_df.index, forecast_df['electricity_price_residual_final'], label='Forecast', color='orange', marker='*')
plt.title('Historical and Predicted Electricity Price')
plt.xlabel('Date')
plt.ylabel('Values')
plt.legend()
plt.show()
10. Conclusion and Further Resources
In this blog post, we’ve journeyed through the comprehensive process of building and interpreting a Vector Autoregression (VAR) model, specifically tailored to forecast electricity prices using related time series data.
Suggested Resource for Further Learning:
- “Forecasting: Principles and Practice (3rd ed)” by Rob J Hyndman and George Athanasopoulos, OTexts. This book is an excellent resource for anyone looking to deepen their understanding of forecasting techniques, including VAR modeling. It offers a blend of theoretical knowledge and practical application, making it suitable for both beginners and experienced practitioners in the field.
Access the Complete Analytical Journey
For the full code of this time series analysis using VAR modeling, check out my Jupyter notebook on GitHub: VAR Modeling Repository.