Forecasting Hourly Electricity Consumption with ARIMAX, SARIMAX, and LSTM
This article investigates the applicability and potential of advanced forecasting techniques such as ARIMA, SARIMA and their regression versions of including exogenous variables, ARIMAX and SARIMAX, as well as Recurrent Neural networks (RNNs) such as the infamous Long-Short-Term-Memory (LSTM), on the electricity consumption of student household in Aarhus.
I will provide a walkthrough example of forecasting 168 observations into the future, which represents exactly 7 days. the language used in this example is Python including common packages such as Pandas, Numpy, Seaborn, Statsmodels, Tensorflow, and Keras.
Introduction
Since prices are continuing to rise and the future of the energy situation in Europe is rather uncertain, one might want to investigate their personal household energy consumption to assess the current demand patterns and predict their future consumption. Energy prices are heavily fluctuant at the moment, therefore one might further want to use such forecasts to prepare for rising costs and the eventuality of a blackout, all of which are dependent on the consumption itself. If personal consumption can be somewhat accurately forecasted, households have a better understanding of how to change their consumption behaviour and become more conscious about the efficient use of energy. An Hourly forecast could be used to adjust the usage of certain appliances which use a lot of energy, while Daily and Monthly forecasts could be used to investigate the current trend of energy consumption and spark efforts to reduce future consumption.
While energy consumption in a household is composed of a variety of sources, the primary source is arguably electricity consumption. Data about electricity consumption is furthermore readily available from most electricity suppliers and can be easily retrieved by the average user either through APIs or bulk downloads from the supplier’s website. Multiple traditional and advanced time series analysis and forecasting algorithms could be applied to this problem. This article focuses on three algorithms, the traditional ARIMA(X), the SARIMA(X), and Long-Short-Term-Memory (LSTM).
Dataset
The data was retrieved from a danish electricity provider and the Danish Meteorological Institut (DMI) and processed with the Pandas package in Python. missing values have been imputed using the LOCF (Last Observation Carried Forward) Method. I have made the final dataset available on Kaggle and it can be downloaded here.
Removing uncorrelated Features
To assess the usability of the given features in the dataset, we are going to create a correlation heatmap after importing the dataset.
The correlation heatmap is based on the Pearson Correlation coefficient, which states that any value lower than 0.8 or greater than -0.8 is insignificant, as it is the most commonly used coefficient in statistics.
After inspecting the correlation heatmap in Figure 1, significant independence of the target variable kWh from the other features becomes apparent. According to the Pearson Correlation coefficient, none of the 18 variables in the dataset are significantly correlated to the electricity consumption of the residents. This could partly be explained by the different utilities used during winter and summer. One might assume that electricity consumption in winter would be higher than in summer since days are shorter and lack of natural light needs to be compensated by the use of lamps and other light sources. However, in summer when temperatures are high, the residents are more inclined to use ventilation systems to cool down the room. These ventilation systems are primarily electricity-based and potentially cancel out the increase in electricity consumption due to lack of light. The argument stems from the knowledge that the residents use gas-based heating, which is a different energy source not included in the dataset. The decision was made to drop most of the features previously processed and added.
Outliers
When inspecting the distribution of the target variable kWh in Figure 2 it becomes apparent that some outliers are present in the data. These outliers seem to be rather rare and do not occur very often. kWh over the entire period has a mean of around 0.20 and some deviation. However, for the given period 1878 observations have a kWh value of 0.5 or higher, 370 observations have a value of 1.0 or higher and 7 observations have a value of 2.0 or higher. The extreme outliers (kWh > 2.0) seem to occur rather arbitrarily, while moderate outliers seem to occur more often in the autumn and winter months. None of the outliers were removed since kWh is the target variable and the data should be as authentic as possible to model the natural behaviour of electricity consumption in the residential home.
Algorithms and Forecasting Approaches
The following section will elaborate on the definition and functionality of the algorithms used in the analysis part of this paper. This section will be rather short since it is assumed that most methods are widely known mathematically and well understood.
(S)ARIMA(X)
The so-called AutoRegressive Integrated Moving Average (ARIMA) model is a combination of an Autoregressive model of order p and a moving average model of order q with a differencing component d. This method is also referred to as the Box-Jenkins Methodology. The order of the ARIMA model (p,d,q) can be easily specified with the auto_arima function in the python package pmdarima which is based on the auto.arima equivalent of the programming language R.
The Seasonal ARIMA (SARIMA) is an extension of the previously described ARIMA model. It adds a further autoregressive model of order P, a moving-average model of order Q, and a differencing component D for the seasonal nature of a series. Furthermore, the seasonality is defined as m by specifying the period which could be e.g. daily, weekly, monthly, or yearly. The final specification has the following components (p,d,q)(P,D,Q,m). A further extension of the model is the SARIMAX which is the addition of exogenous variables to the SARIMA model. This model allows for the inclusion of exogenous variables such as meteorological data or further engineered features such as the hour feature. These terms are not autoregressed on and are simply just added to the model specification. The exogenous variable at time t influences our series at time t. In the same logic, the exogenous variable at any time t of future predictions of our series will have to be present such as to estimate e.g. kWh at t50 we will need the value of the exogenous variable for that particular future observation at t50. This might pose some issues if the exogenous variables used in the model are features that carry future uncertainty such as meteorological data since the estimation of future values would need to be estimated through an auxiliary forecast. Since no significant correlation was found between the meteorological features and our target feature, this will not be further investigated in this article.
Long-Short-Term-Memory
A simplified explanation of the LSTM anatomy would be to view the upper sequence c in figure 6 over many timesteps as the conveyor belt which retains information depending on new input i. At a given timestep t, new information is added to the network through an input gate, while at the same time, multiple forget gates determine which information should be retained and put on the conveyor belt to be carried on to the next step. The forget gates are equipped with sigmoid functions to squish values between 0 and 1. The closer a value is to 0 means forget whereas 1 means retain.
Model estimation for ARIMAX and SARIMAX
To begin with, we are going to estimate the best model for predicting hourly electricity consumption, given our data. We will be using the previously mentioned python package pmdarima which is based on the auto.arima equivalent of the programming language R. By default, this package uses the Akaike Information Criteria (AIC).
Please note that I have reduced the time period for this model estimation to 3 months. The reason is, that the auto_arima function does not scale well on large datasets with high frequency.
Test for stationarity
To test whether our target series is stationary or not, we use the Augmented Dickey-Fuller (ADF) methodology.
The kWh series on an hourly basis was found to be stationary by using the Augmented Dickey-Fuller (ADF) method with a p-value of 1.538983e-25.
Forecasting with ARIMAX, SARIMAX, and LSTM
The order of the ARIMA, SARIMA, and SARIMAX models in the following section was determined by the auto_arima function of the pmdarima package in python using the Akaike Information Criteria (AIC). The LSTM models are determined by manual tuning of some hyperparameters and the tuning process will be mentioned and commented on when necessary. All results are evaluated using the Root Mean Squared Error (RMSE) and summarised in table 2 at the end of this section.
Hourly Forecasts with ARIMAX and SARIMAX
The forecasting horizon was set to be 168 observations into the future, which represents exactly 7 days into the future. The forecasting horizon was chosen to be 168 instead of 24 since we are more interested in long-term forecast performance and getting insights into energy consumption further ahead in time.
First, we split the dataset into train and test set and create a second data frame for our exogenous variables.
Second, we run the algorithms in the following order:
1. ARIMA(2,0,0)
2. SARIMA(2,0,0)(2,0,0,24)
3.ARIMAX(2,0,0)
4.SARIMAX(2,0,0)(2,0,0,24)
The following code example shows the entire modeling process for ARIMAX and SARIMAX from algorithm execution, collecting results, and visualizing the forecasting performance.
we get the following results:
ARIMA(2,0,0) MSE Error: 0.069
ARIMA(2,0,0) RMSE Error: 0.262
SARIMA(2,0,0)(2,0,0,24) MSE Error: 0.059
SARIMA(2,0,0)(2,0,0,24) RMSE Error: 0.243
ARIMAX(2,0,0) MSE Error: 0.044
ARIMAX(2,0,0) RMSE Error: 0.209
SARIMAX(2,0,0)(2,0,0,24) MSE Error: 0.044
SARIMAX(2,0,0)(2,0,0,24) RMSE Error: 0.210
and conclude that so far the best performing models on this problem according to an RMSE of 0.209 and 0.210 are the ARIMAX(2,0,0) and SARIMAX(2,0,0)(2,0,0,24). It is worth mentioning that the ARIMAX model is significantly less complex and took less time to execute than the SARIMAX.
Hourly Forecasts with Autoregressive univariate LSTM
There are different types of forecasting methods using Deep Learning algorithms for time series forecasting. I am going to focus on Autoregressive Univariate forecasting of the target features kWh where every forecasted observation is being appended to the sequence and fed into the Neural Network as new input. Consequently, on longer forecasting horizons, the model will make predictions on already predicted data.
The following code example shows how a simple autoregressive LSTM can be used on these data. No hyperparameter tuning is carried out.
The LSTM forecasting RMSE was 0.2713 and therefore was not able to beat the so far best model ARIMAX(2,0,0). This could be because we have spent no time on hyperparameter tuning. It could be interesting to further investigate how the tuning of an LSTM changes the model performance.
When investigating the model performances in Figure 3 and Figure 4, we quickly realize that non of our forecasts are very good at predicting the hourly electricity consumption for this residential student household. There could be a multitude of reasons why this is the case.
Firstly, The series to be forecasted is the electricity consumption of only one single household and not an aggregate of a multiple of households in one specific location. It could be argued that the so-called sample size, in this case, is only one, since only the electricity consumption of one single student household was analyzed and forecasted. The reader should therefore be careful to assume that there is no relationship between electricity consumption and the weather for all residential households in Aarhus, Denmark. It could very well be the case, that an aggregate of more than 5000 households shows significant relationships with the weather.
Secondly, the series showed a significantly stronger dependency on the seasonal pattern which is directly related to either daily, weekly, or yearly seasonality. This makes sense as the residents of the household are working student individuals and therefore do not regularly use any electricity at home from 8 am until 4 pm. This however is not always the case. Since both individuals are students, there is a lot of fluctuation in the timetable and a distinct seasonal pattern on an hourly as well as weekly basis might be distorted. It could be assumed that this would change when analyzing similar data from a normal adult working household. This random component could very well be assumed to distort the relationship with the weather features.
Thirdly, if the residents are on vacation, Ill, or otherwise absent, there is only minimal electricity consumption for e.g. the fridge, which might distort the potential relationship with the weather. If more households would have been sampled, the absence of residents due to the previously mentioned potential cases would not be as high of an outlier as it is in the current data.
If you have any questions regarding the code or the data used, feel free to reach out!
Thanks for reading! :)