Machine Learning in a Topsy-Turvy World

What to do when models go wrong

12 min readJun 22, 2020

The impact of the COVID-19 virus on life and commerce has been enormous. A second order impact is that the performance of machine learning models developed using data which represents behavior that was considered normal when the models were developed has in many cases deteriorated due to ‘new normal’ conditions.

By definition, empirical models learn from historic data. When the relationships represented in historic data suddenly change dramatically as a result of exogenous influences, patterns and relationships identified previously by algorithms are often no longer valid.

In this note we will look at the impact of the COVID-19 virus on electricity consumption (demand), and by extension, the impact on models used to forecast electricity demand. We will then identify generic strategies that can help mitigate the adverse effects of exogenous systemic shocks on data sources employed to develop machine learning models, using electricity demand forecasting for concrete illustrations.

What is Demand Forecasting?

Demand forecasts are used by electric utilities around the world to predict electricity consumption in the future, both for operational reasons (to identify the mix of electricity sources best suited to meeting the forecast demand) and for revenue projections (demand has a direct relationship with price; demand — or load when actually realized — times price produces revenue for a utility). Demand forecasts typically provide hour-end demand (though other resolutions are also used). Short term forecast horizons range from 1 to 2 days (current day and day-ahead), up to a few weeks; long term forecasts can span months or even years (for capacity planning). Arguably the most important forecasts are short-term current day and day-ahead forecasts; these generally have the most direct impact on grid operations and on electricity pricing.

Total electricity demand is essentially the sum of three electricity consumption components — residential, commercial, and industrial. Utilities have years of historical data which show that residential and commercial demand fluctuations are largely driven by weather conditions and the day of the week, while industrial demand is generally more independent from weather, but can change abruptly based on other factors, such as a large manufacturer halting production for maintenance or to change a plant’s product mix. Additional influences, particularly on residential and commercial demand, include seasonality (which influences weather), local or national holidays, and special events such as Super Bowl Sunday in the United States.

These combinations of influences (weather conditions, day of week, time of day, etc.) are, in effect, the inputs for demand forecast models; a machine learning algorithm then identifies the relationships in the inputs which affect demand. The important point is that some level of demand, within relatively stable ranges, historically reflects various combinations of the aforementioned inputs; put simply, those relationships are what a demand forecasting algorithm learns.

The COVID-19 virus virtually overnight resulted in plummeting demand as economic activity slowed and governments began issuing shelter-in-place and similar restrictions on behavior and commerce; yet the most influential measured inputs — weather conditions and the day of the week — did not radically change from previous years. As a result, the relationships upon which demand forecast models were based became increasingly out of synch with current conditions.

Put another way, consider that if input values x, y, and z were originally associated with target value t in training data, but under current conditions, the same x, y, and z values are associated with target value t’, and t’ is only (for example) 50% of t, it is easy to see that algorithm output will of course be biased towards the relationships learned from the training data, and thus performance based on current conditions will degrade, oft-times significantly.

How Bad is the Hit?

The following three images capture the dramatic impact of the COVID-19 virus. While the Methodology and Resources section at the end provides additional information about how these images were generated, suffice it to say that the charts represent real data and (for 2020) model results (albeit not from optimized models) for demand forecasting in the US Midwest — a service area which includes parts of western Pennsylvania and eastern Ohio.

First, as a frame of reference, the image below shows hour-ending actual (measured) demand for the first standard week (Monday — Sunday) in March 2019. Forecasts for the week are not available. Note several hourly periods had peak demand above 9,500 MW (Megawatts — an instantaneous measure of power), and only for a brief time between midnight on March 9 and 8:00 on March 10 (Saturday into Sunday) did hourly demand fall below 6,500 MW.

Electricity Demand for First Standard Week of March **2019**

Next, the following image shows actual demand for the comparable standard week (first Monday — Sunday week of March 2020), along with the day-ahead (Tomorrow) demand forecast produced by a single model executed daily over the course of the week (the black line is actual demand, the green line is forecast day-ahead demand — i.e., the forecast was produced on the previous day). This weekly demand reflects activity before widespread shelter-in-place and similar government restrictions, but after some reductions in activity that began in late February 2020. The lower portion of the image shows the hourly average model forecasting error in MW. Although the model mostly over-forecast demand, there were several periods where demand was under-forecast.

Electricity Demand and Forecasting Error for First Standard Week of March **2020**

Finally, the image below shows actual demand, day-ahead forecast demand, and accuracy for the final standard week of March 2020 — using the same model employed for the first week forecasts. Note that these data reflect that on March 16 (the beginning of the third standard week of March) the Governor of Pennsylvania ordered that Pennsylvania essentially shut down, and on March 22 the Governor of Ohio ordered that Ohio also shut down, effective March 23 (the start of the fourth standard week in March 2020).

The consequences are obvious. In the fourth week of March 2020 overall demand drops throughout the week. The highest week-day peak is below 8,000 MW and only a few periods early in the week had demand above 7,500 MW. The lowest demand is below 5,500 MW. More important, for our illustration purposes, while the maximum hourly forecasting error still occurs in the early evening, hourly error magnitudes are both larger and consistently positive across every hourly period when compared with errors for the first week of March 2020.

Electricity Demand and Forecasting Error for the Last Standard Week of March **2020**

Without a doubt, the emergence of the COVID-19 virus and the attendant government responses resulted in precipitous reductions in hourly and daily demand, which in turn significantly affected existing model performance.

So What can be Done?

When machine learning models developed as part of a standardized production modeling methodology suddenly cease to perform as expected, there are basically two ways to address the problem (assuming that eliminating the forecasting models is not an option) — adjust model output in some principled way, and/or retrain the models.

While the latter is arguably the ‘better’ approach, in that it will continue to apply the mathematics inherent in whatever machine learning approach underpins the forecasting methodology, the former arguably permits a quicker response to significant exogenous events, at the potential cost of direct reliance on human judgment — the quality of which depends on the domain experience of the person making the judgment.

In the remainder of this document we will illustrate the potential efficacy of both approaches, and suggest that the approaches are applicable to a wide variety of machine learning models when the operating environment in which they are deployed moves to a level very different from the level represented in the data used when models were trained.

Adjusting Algorithm Outputs

When there is a fundamental shift in the magnitudes of target outputs for a machine learning model, applying principled adjustments is often the quickest short-term solution to deteriorating model performance. In the case of electricity demand forecasts, the adjustments must, however, continue to respect inherent characteristics of outputs that still have ‘natural’ differences (demand at 01:00 still differs from demand at 08:00 or 16:00; demand on week-ends is still inherently different from demand on week-days).

Put simply, accumulating performance information on an hourly basis for at least a week is, if not necessary, certainly prudent (while applying purely ‘human experience-based’ adjustments during the time that actual model performance information is collected). The consistency of model errors when viewed across multiple hours provides a basis for determining the adjustment to apply to each hourly forecast. The process consists of the following steps; actions and results are illustrated in related images.

Accumulate Error Information — aggregate error information by hour and by day of week for the most recent week or weeks of actual demand (area outlined in the image)

2. Use Error Information to Determine the Adjustment — the average hourly error magnitude is the basis for selecting an adjustment (an offset that will either be added to or subtracted from model output)

Use Error Information to Determine Adjustment Value

3. Apply Adjustments to Model Outputs when Generating Future Forecasts — typically this would be the day-ahead forecasts produced every day for the next week; review results to confirm the validity of adjustments (the image immediately below shows Base forecasts without adjustments; next is an image which shows the effects of applying forecast adjustments)

Day-Ahead Forecast for Week after Week 4 **WITHOUT** Output Adjustments

Day-Ahead Forecasts for Week after Week 4 **WITH** Output Adjustments

Note the significant reduction in total forecasting error over the course of the week — from 116,819 MWH to 38,652 MWH (Megawatt-Hours — total power over time). As described, this adjustment process can be repeated on a rolling weekly basis until new models are available that have been trained on data which better reflect the ‘new normal’ operating conditions — the approach discussed in the next section.

Retraining Models when New Data is Limited

The power of machine learning can be a curse when target outputs change rapidly in environments where rapid changes are not the norm (while some machine learning applications, such as high-frequency trading models, are designed for rapidly fluctuating environments, judging by market activity in March they too suffered performance hits, though that would be the subject of another paper).

The gradual adjustment of weights during iterative machine learning is conceptually designed to elicit patterns during learning (generalization), rather than to construct a complex look-up table to retrieve patterns (memorization). Thus, each individual training record has a small influence in the overall training process. When little data representing significantly different operating conditions exists, the sparse data that is available can be replicated so that it effectively comprises a larger proportion of training data.

Although replicating records can potentially lead to other issues, since a side effect is that distributions of input values are artificially skewed, when replication is done with an awareness of potential issues, the approach will generally yield better-performing models. All things considered, a reasonable rule-of-thumb is to keep the proportion of replicated data to between 1/3 to 1/2 of the total training data.

To demonstrate the efficacy of this approach, one set of models was trained using a ‘baseline’ dataset in which the last week of data was the actual demand (and weather) for the standard week immediately preceding the week for which forecasts would be generated. This approach was then rolled forward for the 4 standard weeks in March 2020. For comparisons, a second set of models was trained using the same dataset, except that the last week of data was replicated 9 times, meaning a total of 10 records for each period in each day of the last week, before training each set of weekly models.

In other words, models to forecast Week 1 demand were trained with data that ended on March 1 (the immediately prior Monday — Sunday week); for the second set of models to forecast Week 1 demand, the last week of data was replicated. Then models were trained to forecast demand for Week 2; the training data for that set of forecasting models ended with Week 1 historical data, and Week 1 historical data was replicated for training the second set of models for forecasting Week 2 demand. And so forth. This resulted in replicated data comprising a little over 1/3 of the entire training data for models in the second set for each respective week. Modeling results for both scenarios, comparing hourly MAPE metrics of models trained with and without data replication, are presented below. MAPE (Mean Absolute Percent Error) is a standard metric for reporting the accuracy of demand forecasts; a lower MAPE is better.

This image shows significant model performance degradation in the day-ahead forecasts for Week 4, based on no replication of Week 3 data in the training dataset for Week 4 models (recall that Week 3 was the first week in which significant activity restrictions were imposed by governments).

Day-Ahead Forecasting Error (MAPE) by Hour with NO Data Replication

The image below shows the reduction in MAPE that resulted when models were trained using a dataset in which the week of historical data immediately prior to the target week was replicated in the training dataset.

Day-Ahead Forecasting Error (MAPE) by Hour with Replication in Training Data

While there is some reduction in forecast error in Weeks 1–3 as a result of replication, significant improvement occurred with the use of replicated Week 3 training data to produce models to forecast Week 4 demand. The use of replicated data yielded a roughly 33% reduction in MAPE for the day-ahead forecasts for Week 4. This effect is also seen in when comparing the composite (summary) MAPE results across all weeks in March 2019 and all weeks in March 2020, as shown below.

Comparisons of March 2019 Composite MAPE and March 2020 Composite MAPE with and without Replication

While the modeling methodology (discussed in more detail in the next section) for this work was not optimized fully as would be the case for production models, we believe the results presented demonstrate that even when events conspire to adversely impact in-service machine learning models, it is possible to relatively quickly minimize the impact in principled ways, permitting attention to focus on preparing new training datasets that more closely reflect whatever the ‘new normal’ conditions have become. And while electricity demand forecasts were used as a vehicle to explain the approaches, the approaches would equally apply to virtually any machine learning task domain.

Resources and Methodology

The demonstration models which produced the results reported in this document were trained using historic values for weather variables, some future weather predicted values, and historic demand. The following chart defines the time spans for each of the training datasets. In effect, 3 years of data within the defined Modeling Span periods were used for each set of weekly models. Reported model results were based on running models using the Validation data range corresponding to the week of interest.

Time Spans for Modeling and Validation Data

The models and the images presented in this document were created by NeuralPower® (with the exception that the final 3 MAPE charts were created in Microsoft Excel using results generated by NeuralPower). NeuralPower is a full-featured professional platform for integrated electricity demand and price forecasting offered by NeuralStudio SEZC.

NeuralPower employs the proprietary NeuralWorks® neural network engine for automated machine learning. For a given forecast horizon (1, 2, or 3 days), NeuralPower explores the model parameter space and the training data space to construct multiple neural network models for each hour-ending period in the selected forecast horizon. For example, a two-day forecast horizon results in 576 individual neural networks (12 for each hourly period, for two days).

NeuralPower uses a wide variety of input data, provided by customers, to train forecasting models. NeuralPower includes facilities for ‘shaping’ training data records to account for values, particularly weather values, having prior and future influence on the demand for the specific hour that is the target demand hour. NeuralPower also accounts for seasonality and day of week influences.

When training concludes, the best models, ranked either by MAPE or R Correlation, are deployed to generate daily forecasts. A companion component, NeuralPower Scheduler, uses deployed models to automatically generate forecasts at specific times for distributions to those who rely on the forecasts, using data stored in specific locations in an enterprise.

For simplicity and consistency in reporting results in this document, for each hourly period models were ranked by MAPE and the best single model was used to generate the forecast for the subject hourly period. In practice, the architectures of top models would be reviewed, and likely multiple models for each hour would be deployed.

For questions about this document, or information about NeuralStudio and NeuralPower, you can contact jack AT neuralstudio.ai.

Acknowledgements

The Confusion Prevails when Models go Wrong image was created by Gerd Altmann.