Forecasting Energy Demand Using a Long Short-Term Memory Network

10 min readMay 2, 2022

An application of an LSTM network to a real-world time series problem

I previously interned at a company that works in projects across the energy sector, where I developed an interest in energy demand (“load”) forecasting and its significance to various players in the energy market.

For producers and grid operators, load forecasting is used to help operate bulk electric systems and plan for updates. Others use these predictions to decide on investment in new plants or equipment. In addition, load forecasting has become popular in more business-oriented markets such as financial planning and energy trading.

Short-Term Load Forecasting (STLF) refers to energy forecasts within an interval of an hour to a week. I thought it would be an interesting challenge to focus on STLF — specifically to see if I could create a model to predict the next hour’s load given inputs from previous hours. I did so by creating a Long Short-Term Memory (LSTM) network, a type of neural network structure that works notoriously well for sequential data.

In this article, I discuss this project from beginning to end. I touch on preprocessing, baseline model results, and the improved performance of my final LSTM network.

Problem and Approach

The datasets I worked with were a combination of publically-available information on weather and load for regions covered by ISO New England, the corporation responsible for distributing energy across the 6 New England states. I used hourly data from October 2018 to present, which at the time of this project constituted 3 years of data.

Since the regions controlled by ISO-NE were likely to have different energy demands due to each area’s specific geographical attributes, I decided to simplify the problem by honing in on only one of the 8 regions. I selected the Connecticut ISO zone.

The challenge at hand was to see if I could accurately forecast one-hour-ahead load for the Connecticut ISO zone given past values of the features I had available. I quantified accuracy as the ability to outperform some baseline model, meaning it had a lower test error rate.

Difficulties in Energy Forecasting

Short-term load forecasting is a notoriously difficult task.

It is a nonlinear problem. Commonly-used features such as temperature and humidity are not linearly related to the next hour’s load. This means many basic models are not equipt to produce accurate predictions.

Average hourly load and temperature show nonlinear relationships with next hour’s load

STLF additionally suffers from seasonality: the next hour’s load is dependent not only on the previous hour’s load, but also on the load at the same hour on the previous day, the same day in the previous week, and so on.

The energy sector even has a name for these regular fluctuations in demand. “On-peak” hours refer to those when demand levels are highest (typically between 7 am-10 pm on weekdays, whereas demand is lower during “off-peak” times. These patterns are found seasonally, too, as demand in the summer/winter months is generally higher than in fall/spring.

Hourly load fluctuates daily within a range of ~1500 megawatts

Finally, it is difficult to target periods of rapid fluctuation. Most standard statistic models are not flexible to rapid system load changes, which is problematic as these are times when accurate forecasts are especially needed.

The graph above shows the hourly load for the Connecticut ISO Zone on a randomly-chosen day. Notice how between the hours of 8:00 AM-12:00 PM the demand for energy follows a rapidly-changing, irregular path within a range of over 100 megawatts.

Neural networks can overcome these challenges due to their highly flexible structure. In particular, Long Short-Term Memory (LSTM) networks are widely known for their application in time series problems. These networks can handle nonlinear data, overcome seasonality effects, and do well in periods of high fluctuation. They also easily allow for the input of multiple features of any type (as opposed to other common time series models).

Preprocessing

After merging the data for average hourly temperature and load, I created my target variable — the load in the next hour.

My next step involved creating new features that I believed may prove helpful in my model. To try and target the seasonality effects, I generated one-hot-encoded variables for whether the hour was on a holiday and for what season it was in. I then decided to incorporate more short-term regular trends by cyclically encoding both the hour and day of the week of each sample.

Cyclical encoding involves normalizing one’s numerically-encoded features to match the 0–2π cycle and then finding the sine and cosine representations of each. This is often preferable to just having a feature with a range of 1–24 hours, as our neural networks aren’t smart enough to know that the hours ending in 1 and 24 are just one hour apart, not 23. Both sine and cosine representations are necessary because if we only were to use one, we run into an issue where two hours each day will have the same value. Again — our model isn’t smart enough to know that these are different times, and we run into problems.

Finally, I created a one-hot-encoded feature for whether or not the hour was determined to be “on-peak.”

My final dataset was as follows:

Feature values for randomly-chosen hours in the proposed dataset

Train-Test Split

To quantify error I calculated both the Mean Absolute Error (MAE) and Mean Squared Error (MSE). I wanted to get a sense of the average error using the MAE and the overall error using the MSE. I did so by splitting my data set into training and test sets.

The test set consisted of one week randomly chosen from each month in a calendar year. Weeks were taken from the most recent occurrences of each month in my data set.

Variable Selection

Deciding what features to include in one’s final model can be a complicated task. I used two tools to get a sense of which features would be most influential in determining the next hour’s load.

Tree-based algorithms have a nice quality in that they allow you to generate measures of feature importance. Feature importance assigns a score to each feature in a predictive model based on its relative significance in generating predictions. I calculated feature importances for all considered features using both a random forest and boosted tree model. For the boosted tree I implemented XGBoost, a popular boosting algorithm.

Feature importance calculated by tree-based models

I ran these models on my training set. Both models indicated that LoadMW was the most influential feature, which is not surprising given the existing literature on energy forecasting. They also indicated slight importance in TempF and the two features that encoded the hour of the day (cos_HourE and sin_HourE). These importances were very small compared to LoadMW, so I decided to rerun the models without this dominant feature.

Feature importance calculated by tree-based models after removing LoadMW

After removing LoadMW, the random forest and boosted tree identified OnPeak to be important. The random forest also found TempF and the hour-encoded features to be influential. The boosted tree attributed more importance to OnPeak than TempF but still found the latter to be second-most important. This model also emphasized the importance of the hour-encoded features but surprisingly gave comparable importance to all of the seasonal features except for Summer.

At this point, I was convinced that it may be worth including the features for load, temperature, hour, and on-peak/off-peak. The encoded features for the day of the week and whether or not it was a holiday showed minimal benefit, so I was doubtful that they would help. I was unsure about humidity and the seasonal variables given their significance in some models but not in others. As an additional check I calculated the correlation of each feature with the target, LoadMW_Plus1.

Correlation of considered features with target

As expected, cos_DOW, sin_DOW, and Holiday were barely correlated with LoadMW_Plus1. Summer and Spring showed some promise but Fall and Winter had a weak correlation. I dropped the first three features and created a new one-hot-encoded variable to replace the seasons — this was defined as 1 if the hour was in a warm month (spring/summer) and 0 for cool months (fall/winter).

Correlation of considered features with targets after condensing seasonal variables

After rerunning the correlation with LoadMW_Plus1 this new variable dropped to the bottom of the list. I therefore decided to consider the remaining features for modeling: LoadMW, OnPeak, sin_HourE, cos_HourE, TempF, and Humidity.

In the end, my network performed best when using all six of these features.

Baseline Model

I like to create a simple baseline model to compare the results of more flexible models against when one is not readily available. It is not always true that more complexity is associated with better results. Using a baseline model is a good way to tell if you are benefiting from using a neural network, or if all that time spent running epochs was a waste of time and computational power.

I chose to keep things simple by fitting a multiple linear regression with the same features chosen for my final network.

I trained and tested my model using the schema described above. The test resulted in an MAE of ~82 MW and an MSE of ~11,364 MW. This means that the multiple linear regression was off by around 82 megawatts per hour on average, and by around 11,364 megawatts² per hour on average.

Note that this means the RMSE is about 106 megawatts. Since the RMSE gives a relatively high weight to larger errors compared to the MAE, this suggests that the model showed specific instances of very large error, which is unsurprising given the difficulty of modeling times of rapid fluctuation. These results were not ideal. I moved on to the LSTM framework to see if that model could perform better.

Long Short-Term Memory Network

LSTM models learn from a series of past observations of features to predict the next value of the target. This sequence has to be transformed into multiple input-output samples for it to be used by the model framework, which I did by adopting a function from machinelearningmastery.

After scaling the necessary features and my target I fed the training and test sets into said function. Multiple model runs showed that using the 9 most recent hours provided a nice balance of accurate results without garnering too high a computational cost.

Input and output of the first training sample. Each list in the input corresponds to feature values from a previous timestep

As with any neural network, there were many hyperparameters to consider when designing the architecture. I created a wide variety of models with various numbers of layers, nodes, activation functions, optimizers, batch sizes, and the number of epochs and kept track of the results. I additionally toyed with including dropout as a regularization method.

The model that performed the best included two LSTM layers of 100 nodes each, relu activation functions, the ADAM optimizer, and 50 epochs composed of 32 training samples. Using a regularization method actually worsened the performance of the network, so no dropout was included.

Epoch versus loss function for the training and test set

The network was able to reach convergence in the training data within 10–20 epochs and required additional epochs for the test set. The overall runtime was around 7 minutes, which is not bad for a rather complex model.

The test resulted in an MAE of ~23 MW and an MSE of ~1,076 MW. This means that the LSTM model was off by around 23 megawatts per hour on average, and by around 1,076 megawatts² per hour on average. The RMSE is around 33 MW, still larger than the MAE. Thus we can infer that the LSTM also struggles to target those times of rapid fluctuation.

The charts above show the predictions from the baseline Multiple Linear Regression model and those of the LSTM network. It is easy to see the improvement provided by the LSTM model. In fact, this improvement is rather significant: a 72% decrease in MAE and a 90.5% decrease in MSE.

Conclusion

In this article, I discussed my implementation of a Long Short-Term Memory network to forecast one-hour-ahead load for the Connecticut ISO zone. I achieved the goal I had set out for, which was to outperform some chosen baseline model.

In the future, I would love to try and forecast load out for a longer period. Forecasting the next hour’s energy demand is useful, but in practice may be difficult to deploy in a timely manner. Having something like a 6-hour ahead forecast, or “load profile” as it’s often referred to, provides a more difficult challenge but would yield more useful results.

Thank you all for reading! I hope you enjoyed seeing this real-world application of an LSTM model.

The Python libraries used include Pandas, Matplotlib, Scikit-Learn, and Keras. For more information and to see the code, check out my GitHub repository.