Using Machine Learning to predict US electricity demand: a comparison of two approaches

17 min readDec 4, 2023

Comparing XGBoost and Prophet in methods to predict electricity demand in the United States.

Smart grids have the potential to address a major supply-demand imbalance on our energy grid.

Electricity smart grids rely on a robust algorithm for dynamic electricity demand forecasting. However, electricity demand (and energy as a whole) can be difficult to predict on a local basis due to different local needs.

Given the push toward energy-intensive data centers to support the exponential growth of AI hardware, this process of forecasting energy demand will be crucial to allocating energy effectively around our power grids.

“There’s a well-publicized arms race happening in AI, and the major tech companies are expected to invest $1 trillion over the next five years in this area, mostly to data centers.” — Jonathan Gray, Blackstone’s president and chief operating officer

And these new data centers quite literally eat electricity. For example:

Amazon operates, or is in the process of building or planning, 102 data centers in northern Virginia. Together, the facilities, when they are all up and running, will have emergency generators capable of producing more than 4.6 gigawatts of power. That’s almost enough backup electrical capacity to light up all of New York City on an average day.

This anticipated growth makes it clear that we need to invest in

Clean energy production methods
Methods to optimize our energy grid.

Smart grids are one way to optimize the energy allocation of our grid.

With this project, I am going to work with two robust machine learning methods to forecast electricity demand from past usage history. This project will provide me with a fundamental understanding of how we can use emerging technologies to further optimize our power grids.

This project is part of a larger project I am working on to optimize our global Internet Network.

Over 3 BILLION people worldwide lack access to the basic necessity that is the Internet and it seems the only way to expand global Internet access is to spend more on infrastructure. While expanding our network of Fiber Optic or Coaxial Cables and Cell Towers is the most impactful way to expand access, it is incredibly expensive and relies on governments spending billions on building more infrastructure. In rural areas where there is little ROI for this expensive infrastructure, governments lack an incentive to roll out more cables which disproportionally impacts less wealthy regions.

I am working on a way to optimize the current infrastructure we have with a new way to reduce Internet network traffic, thus increasing speeds in rural areas.

Through an Internet demand prediction model, I hope to use the estimates of future demand in rural areas to contribute to the network shaping and routing table algorithms used to send data across our network. By anticipating traffic patterns, data packets can avoid traffic and travel to their destination much quicker, lowering network latency. I’d be looking to integrate this demand forecasting model into something similar to Dijkstra’s algorithm.

I essentially want to apply a “smart grid” approach to Internet bandwidth routing.

To check out my deep dive into the computer networking and machine learning topics to build this project, check out my article here

Extending Global Internet Access without building anything

How we can extend Internet bandwidth with smart network optimization

medium.com

If you are interested in following my journey as I build this project, check out my socials at the bottom of this article, and feel free to send me a message to chat.

Predicting electricity demand with XGBoost and Prophet

In this project, I built two algorithms with two different machine learning frameworks to compare their accuracy with time series data: Prophet and XGBoost.

Typically time series data is more complicated than regular tabular data as a framework like Sci-kit Learn can be used for everything from pre-processing to prediction to model selection. But with time series, it requires many more libraries for each of these steps (eg. Pandas for interpolating missing values and Prophet for fitting a model as we do here). In this case, I decided to compare the performance of a proper time series algorithm, Prophet, to a decision tree method, XGBoost, to compare two completely different approaches to this prediction model.

Prophet is a time series forecasting algorithm developed by Meta

Prophet is a time series forecasting algorithm developed by Facebook’s Data Science team in 2017 to predict site visits over time with seasonal trends. As a result, the model is most optimal when fed lots of historical data with seasonal effects. The idea behind the algorithm is to model time series data as a combination of trend, seasonality, and noise components. It is commonly used in predicting sales for e-commerce, stock prices, and weather predictions.

Prophet was the pre-cursor to the more advanced NeuralProphet which was released in 2020 and leverages autoregressive deep learning for its predictions. While NeuralProphet is a more advanced algorithm, upon testing, it has been found that Prophet outperforms NeuralProphet when fed much larger datasets. As a result, we will be using Prophet with a very large dataset for our predictive model.

Let’s break down the three components of time series data used by Prophet to understand how it works.

Trend components are used to capture the overall direction of the time series data. A piecewise linear regression model is used to fit this trend meaning that the model can be represented as a sequence of linear segments where each segment is a function of the changepoints (when trend slope changes) in the data. This model can be represented as

y(t) = g(t) + e(t)

where y(t) is the value of the time series at time t, g(t) is the trend component, and e(t) is the error term. The trend component is calculated by this formula

g(t) = (k(t)*t) + m(t)

where k(t) is the slope of the piecewise at that time and m(t) is the intercept of the trendline at that time. The slope and intercepts are modeled with something called a Bayesian Hierarchical Model which is a method of accounting for uncertainties and relationships among complex data relationships.

Seasonality components are used to capture periodic patterns in data over weeks, months, and years. It uses a Fourier series, which is a way to model periodic data in terms of cos and sine waves. Because seasonal patterns come in waves, this is an accurate way of extrapolating trends. The formula used for this prediction is seen as:

It looks more scary when typed out than it does on paper

where s(t) is the seasonality component, a(i) and b(i) are the Fourier coefficients, N is the number of Fourier terms, and P is the period of the seasonality component

Noise counts as any random fluctuations in the data that cannot be explained by trend or seasonal models.

The last thing I am going to discuss with Prophet is that it can accept additional regressors which are additional features to be considered (like a manually provided Holiday list) that make the predictions more accurate. The model uses a regressor matrix with a T x N format where T is the # of time points and N is the # of regressors inputted. The regressor coefficients are then estimated using a linear regression model that relates time series to the regressor matrix in this form:

where X(j, t) is the jth regressor variable at a certain t time and e(t) is the error term as highlighted above.

I am not going to continue with the math behind the Bayesian inference to estimate the posterior distribution and the Markov Chain Monte Carlo algorithm. If you’d like to understand the math behind posterior distribution check out this article.

The benefits of Prophet modeling are that it is accurate and fast, it is automatic (meaning no need for extensive preprocessing and feature engineering), and can be made very accurate with additional regressors.

Let’s shift gears…

XGBoost is the holy grail of data science

XGBoost (Extreme Gradient Boost) is an open-source data science library released in 2014 by Tianqi Chen. It is seen as the holy grail for Machine Learning and often used for Kaggle competitions and hackathons due to its versatility (can be used for many different types of ML), scalability (can handle large datasets), speed, and efficient memory usage (trains trees in parallel).

It is most commonly used in classification and regression (eg. fraud detection of housing price prediction), but also works well for time series data. The algorithm is essentially a more accurate version of the random forest algorithm where multiple decision trees are used to decide on a dataset.

Fundamentally, XGBoost runs off of decision trees which are a visual representation of if/else statements. Each node is a decision, and as the tree expands and gets more complex, specific features can become known of the input data.

However, these decision trees as very basic and are referred to as weak models.

To overcome the limitations of binary trees, multiple weak models can be combined to make a strong model through a process called an ensemble learning method. At a high level, this algorithm recursively minimizes error by simultaneously training multiple decision trees in parallel on a different subset of data and combining them to form predictions. This method offers a systematic solution to combine the predictive power of multiple base learners (each individual tree). The hard part here is actually “building the tree” which means finding the best if/else condition to use for each node of each decision tree. For this process, we use complex math you can find in the XGBoost docs.

Bagging and Boosting are two commonly used ensemble learners (processes of optimizing trees). Bagging is the process of reducing variability in the behavior of a learner when subsets of a dataset are trained in parallel. By averaging out the final predictions from all the learners, we conduct bagging on the learners. This process is the one used in Random Forest Regression.

In Boosting, trees are built sequentially such that each sequential tree aims to reduce the errors of the previous tree. In this way, each tree learns from its predecessor and updates residual errors. The base learner in Boosting is pretty bad (not much better than random guessing), but through this process, the model gets much better. In contrast to Bagging, boosting uses far fewer trees to reach an accurate prediction which is faster and also leads to less overfitting.

We can form this explanation through a golf analogy.

A single decision tree is like tee-ing off one time on a 500-yard hole and hoping you get a hole-in-one. Random forest algorithms (bagging) act as if you and 10 friends tee off all aiming at the same spot, and then you take the average of all those shots as your prediction. Finally, XGBoost (boosting) would be like you tee-ing off once, then going to chip your ball onto the green, and then putting it near the hole. Each shot represents the recursive nature of the algorithm trying to minimize error.

You can see, from this analogy, how the algorithm is optimized for speed and performance.

The goal of XGBoost is to optimize the Objective function, which is the sum of training loss and regularization.

In this function, l is the loss function (similar to Mean Squared Error), k is the total number of trees, n is the number of rows in the data, Ω is the regularization term, and f is the tree.

We optimize training loss to get a good prediction and optimize regularization to avoid overfitting.

This gradient-boosting ensemble technique consists of three steps

An initial model, say F0 for this example, is defined to predict the target variable y. This model has a residual error for y — F0
A new model, called h1, is fit to the residual error y — F0 from step 1 to minimize the loss function (typically mean squared error)
Finally, h1 and F0 are added to make F1, which is the boosted version of F0. Therefore the mean squared error of F1 is lower than for F0

Here is what a fourth step would look like repeating the process to obtain F2.

h2(x) + F1(x) = F2(x)

However, the mean squared error from the residual of F1 is still likely very high. We can repeat these steps any number m times to minimize the mean squared error. Here is the additive learners’ function:

hm(x) + Fm-1(x) = Fm(x)

This process can be repeated to optimize each decision tree. Training the decision trees, however, is a much simpler process that involves finding where to split nodes to minimize error

Pick a possible split value for a node (usually the average of all the data points attached to that node)
Split data points into a left and right node
Calculate the MSE for the parent node, left node, and right node
If the sum of the MSE for the two child nodes is less than for the parent node, keep the split value as the best one. If not, try another split value

Repeat this process until the Decision Tree class is created

To learn how to build a *simple* gradient-boosting model from scratch, check out this article. I put “simple” in asterisks because nothing in ML math is simple.

Code Explanation

A look at the data

Both projects use the same dataset from the PJM Interconnection LLC, which is a regional transmission line in the US. The geographies contained in this dataset span from the East Coast to the Midwest to Middle America.

The dataset was taken from Kaggle. This data is a simple representation of the hourly power consumption of these areas and comes directly from the PJME website.

Prophet Project

Before beginning take a look at my GitHub repo for my full prophet algorithm.

Let’s first take a look at the prediction model using the Prophet library. We are going to go block by block to prioritize understanding the mechanics of the model in code rather than in theory.

We begin with the standard imports: numpy (for numerical computations), pandas (for data manipulation), seaborn (for data visualization), and matplotlib (for plotting). We then import the squared and absolute error functions for the regression metrics we will be using to compare model performance.

I added the line about warnings to ignore any warning messages to keep the console output clean.

We then set the style of the matplotlib using ggplot and FiveThirtyEight (the news site) to style our graphs.

Finally, we create our own mean absolute percentage error function to be used later to show the performance of our model. This statistic is important as it uses percentages which makes relative comparison much easier than absolute.

We then import the dataset from the PJME website I discussed above.

After which we add some further customizations to the code and then use plt.show() to show the current dataset visually.

Data Preprocessing

What we are going to do now is create a matrix of time series features that can be interpreted from the Prophet model from the datetime indices.

We start by creating a data type (categorical as it is weekdays) containing the days of the week for indexing purposes.

We then define a function that takes in the input data frame, makes a copy, and extracts features from the input data frame to be used in the new time series matrix. The existing libraries make this transfer very simple.

We use date_offset and season to calculate a numerical offset based on month and day to account for variability in seasons.

We then create a new data frame, X, which contains a matrix of the selected features above. If a label is provided for the data frame, it maps the X and y values.

Finally, we call this function on our dataset and label it as ‘PJME_MW’. We create a new data frame that concatenates X and y in a new column.

We then visualize this data using a boxplot to see discrepancies over days of the week and seasons. To do so we use the seaborn boxplot call.

A much better way to see the data than before

Splitting into test and train data

We are going to split on Jan 1st, 2015. This is typical of energy demand prediction algorithms where ~85% of the data is used for training and ~15% of the data is used for testing.

We apply some logic to copy over the train and test data to their respective new datasets. We use a subplot to see the split.

Using the Prophet model

Now the fun part.

Fitting the model

We start by initializing the prophet model and then fit (train) the model using the training data. Prophet likes the format of naming the DataFrame of timestamps “ds” and the target variable to be predicted “y”, so we did that before training the model.

Predicting on the test set

After renaming the variable names to the preferred default for the prophet model, I ran the predictions on the test set of data. I then plotted the data based on its prediction.

To put this performance into perspective, I compared this performance with the original red testing data using subplots.

You can see there is a substantial difference between red and blue

We can take an even more detailed look at the difference by essentially “zooming in” on the test set. Here is how I did so

We start by converting the actual test data into a datetime index to plot it. We then create a subplot (10 in width, 5 in height) and plot a scatter plot of both forecasted (fig) and actual (ax) data for February 2015.

How did the model do?

We will use 3 metrics to explain the performance of the model. Two of them were pre-defined in the sklearn library and one we created ourselves, if you remember.

Here is the performance:

Root Mean Squared Error (RMSE): 6616.97
Mean Absolute Error (MAE): 5181.91
Mean Absolute Percentage Error (MAPE): 16.51%

One of the major upsides of Prophet, as I mentioned, is that you can provide additional regressors (like holidays) to the model before training. I did that here.

I then re-trained the model with the new datasets. However, the error metrics stayed pretty much the same. Interesting.

Let’s try XGBoost

XGBoost Project

Let’s try the same thing with this gradient-boosted approach. Before beginning, check out my full code for my XGBoost algorithm.

We set up pretty much everything the same and will be using the same 3 key metrics to gauge model accuracy. We again split on January 15th and we will again use pandas to create time series features from our DateTime index.

Creating our model

I began by using the same create_features function to create a time series interpretable matrix for the train and test sets. I then created the x_train features by using the columns listed by FEATURES to predict the TARGET which is the y_train dataframe. I do this for both the test and train datasets.

I start by initializing the XGBoost regressor using a few specific hyperparameters. This component of the algorithm is one of the most important as it tunes the model and has a substantial impact on its error. You can see all the hyperparameters are explained in the comments.

I did not experiment too much with the hyperparameters, but I’m sure if I did I could lower the error.

Next, I trained the model with the x_train feature and y_train target variables. The eval_set is a list of datasets to be used for evaluation during training to continuously iterate.

This is the idea of getting multiple swings with the gold club as highlighted earlier.

As the model was training, the root mean squared error was outputted for every 100 trees. As you can see, the RMSE was getting smaller for both val1 and val2 datasets. The early stopping ensures overfitting doesn’t occur and the RMSE stops at the local minimum.

One thing I did differently with this experiment was display which feature contributed the most to the training.

The important line here is reg.feature_importances_ which shows exactly which features were used the most. This call is an attribute from the sci-kit-learn library for tree-based models. The output is below. However, there is some overlap between features like dayofyear and month as they are correlated.

If Month were deleted dayofyear would replace it

Now we can take a look at the subplot of the actual vs predicted data. Using the same code as before we can plot this graph:

So with that, we can circle back to the three main comparison metrics used for the models:

Root Mean Squared Error (RMSE): 3726.8
Mean Absolute Error (MAE): 2902.29
Mean Absolute Percentage Error (MAPE): 9.16%

What’s Next…

Comparing the two models,

XGBoost is much more accurate with an MAE of 9.16% compared to Prophet’s 16.51%.

There are obvious confounding variables like the hyperparameters used that I did not tune too much, but this comparison reveals a clear winner when it comes to demand prediction with time series data.

I will continue along my journey of working with similar models for my Internet demand prediction project. I will experiment with XGBoost and a few Deep Learning models, like LSTMs, to find an algorithm that optimizes my ability to predict Internet demand in rural areas.

Thanks for giving this a read. I hope you learned something new about these two methods of predicting electricity demand and the implications this procedure can have on global energy allocation. To stay updated with my project progress, follow this account and subscribe to my monthly newsletter where I will display project updates, unique experiences, and my reflections. Finally, feel free to reach out to me on Linkedin or Twitter (X.com) and check out my full portfolio for other projects I have worked on in the blockchain, climate tech, ML, and computer networking space.