The Arbitrated Dynamic Ensemble for Time Series Forecasting

Pratham Goel
Analytics Vidhya
Published in
12 min readSep 17, 2020

--

Source: https://unsplash.com/photos/4TNd3hsW3PM

“None of us is as strong as all of us”

Ensemble Techniques have become quite popular among the machine learning fraternity because of the simple fact that ‘one-size-fits-all’ can’t always practically hold good for individual models and there are almost always going to be some models which trade low variance for a high bias while others that do the opposite. The challenge is exacerbated for time series forecasting problems because even the best performing models may not perform consistently well throughout the forecast horizon. This is one of the motivations behind the topic of this article — the Arbitrated Dynamic Ensemble (ADE). But more on that in a bit.

First, if you haven’t already, I’d recommend checking out this comprehensive guide to ensemble learning. Ensemble models are exactly that-an ensemble, or a collection of various base models (or weak learners as they are referred to in literature). Several weak learners combine to train a strong learner — as simple as that. How exactly they combine is what gives rise to the various types of ensemble techniques. Several types of ensemble techniques are available, ranging from very simple ones like weighted averaging or max voting to more complex ones like bagging, boosting and stacking.
This blog post is an excellent starting point to get up to speed with the various techniques mentioned.

Building Blocks

Let us now look at two common ensemble learning techniques at a very high level. Weighted Average and Stacking are the building blocks of ADE so we will start by spending some time on these two ensemble techniques first.

Weighted Average

Weighted Average is exactly what its name suggests — predictions of individual base models are combined by taking their weighted average.

p = w1*p1 + w2*p2 + …. + wn*pn; where p1…pn are the predictions by the individual (base) models, w1…wn are their corresponding weights and p is the combined prediction by the ensemble model

At its core, ADE is nothing but a weighted average. How these weights w1, w2, … , wn are calculated is the crux of the algorithm. We will discuss this soon

Stacking

Stacking is another popular ensemble technique where the in-sample predictions of individual models are engineered as features to build a new regression model which is then used to predict out-of-sample
values. More details on this can be found in this article

Motivations behind the ADE

Now that we have the two common types of ensemble techniques out of the way, a good starting point to discuss the main ideas behind the ADE would be to quote the authors of the paper:

“This paper proposes an ensemble method for time series forecasting tasks. Combining different forecasting models is a common approach to tackle these problems.

State-of-the-art methods track the loss of the available models and adapt their weights accordingly.

We propose a meta-learning approach for adaptively combining forecasting models that specializes them across the time series.

Our assumption is that different forecasting models have different areas of expertise and a varying relative performance”

That is a loaded introduction to a novel technique! Let us break it down.

“S-o-t-A Models track the loss of the available models and adapt their weights accordingly”. As mentioned earlier, ADE is at its core, a weighted average of the available models. However, these weights instead of being random, or static, are in fact dynamically estimated using the loss of these available models!

Sounds reasonable to me, the better an individual available model performs (lower its error), the higher the weight it should be assigned in
having a say in the overall forecast! Ok, so how exactly are these errors and as a result weights estimated? Lets look at the next line in the introduction for that!

“A meta-learning approach for adaptively combining forecasting models that specializes them across the time series. Our assumption is that different forecasting models have different areas of expertise and a varying relative performance”. As discussed, especially with time series problems, its highly uncommon to find individual models that perform consistently well throughout the time series. There are pockets of time where a given model does well whereas other pockets where other models do better. ADE aims to leverage this localized expertise of the individual models in these individual pockets and generate a combined forecast for the next pocket of time. Note, by pocket I just mean a moving window of a fixed length sliding across the time series. Ok, but you might still be wondering how exactly does this all work!

Show me the math!

Let us start with this flow chart by the authors of the paper to explain whatever we just described in a nutshell

source: http://ecmlpkdd2017.ijs.si/papers/paperID453.pdf

The Algorithm:

(i) Offline-training of M, (the set of base-learners which are used to forecast future values of Y)

(ii) Online-Training or updating of meta-learners Z, which model the expertise of base-learners, and

(iii) Online-Prediction of yt+1 using M, dynamically weighed according to Z.

Important to understand is that the ADE architecture primarily consists of the following:

i)Base Learners, Mi (Individual, available predictor models)

ii)Meta Learners, Zi (Error-tracking models) — 1 per base learner

Given this framework, we are now tasked with determining the meta learner predictions (ei’s in the flowchart). In essence, meta learners Zi are nothing but regression models that aim to predict the error (ei) in base learner predictions (ŷi). This may seem straightforward and logical, but there is one catch.

With time series, unlike other regression problems, there is the added constraint of temporal dependency and hence, we cannot use any random train-test split to train our models. Note that, for each pocket (window) of time, we need two outputs i) base predictions ŷi and ii) the errors in predictions ei. To obtain (i), we would follow the regular steps that we follow to get univariate time series forecasts (train on a fixed or a walk-forward training window of historical actuals, and forecast on the forecast-horizon window).

For getting (ii) though, we need to be ensure that we have our target variable (error in base learner prediction) and training feature matrix engineered correctly. Using some training features, we need to somehow
predict the error in the next time steps before the actuals for those time steps are available themselves (obviously, if we already had the actuals available for the new time steps, what would we even forecast in the first place!) To tackle this problem, the authors propose a simple data transformation/feature engineering exercise on the actuals, called time delay embedding:

“A time series Y is a temporal sequence of values Y = {y1, y2,….,yt}, where yi is the value of Y at time i.
We use time delay embedding to represent Y in an Euclidean space with embedding dimension K. Effectively, we construct a set of observations which are based on the past K lags of the time series. This is accomplished by mapping the time series Y into the embedding vectors

VN-K+1 = <v1, v2,….,vN-K+1>; where each

vi = <yi-(K-1), yi-(K-2),….yi>.”

The ADE Algorithm’s Flow Chart

The above animation just visually summarizes the main steps in the ADE algorithm that we’ve previously described already. Note that this animation shows just one iteration of the walk-forward training process. The same sequence of steps will be repeated in the subsequent walk-forward iterations as well, in order to make predictions for upcoming timesteps.

Hope this gives you an high-level picture of the working of the ADE model. There are of course, a few more key details that I’ve left out so far that help us tune the performance of the ADE a little better.

1) Weighting Strategy: So far, we’ve only talked about estimating the errors in predictions by the base learners. Notice also, the presence of a SoftMax (of the negative of the error) layer which takes these error estimates as input and returns a probability distribution (array of probabilities adding up to 1). This always scales the errors down to a value between 0 and 1 and makes them easier to compare. By definition, ‘weights’ must add up to 1, so this layer is an ideal choice for computing weights. Note however, that this is not the only way we can approach this. The authors propose an alternative, linear weighting strategy too, one that retains the relative difference between the errors.

2) K: We’ve already seen this parameter, that refers to the number of lags to consider to prepare the time-delay embedding vectors as features to train the meta learners Zi. The choice of this parameter will determine how much of recent history to take into account in estimating future error of the base learners.

3) Meta Model: So far we’ve only talked about the overall ADE algorithm but not touched upon the choice of the meta learners themselves. Good candidates for meta learners are regression techniques that are themselves ensembles like Gradient Boost, XGBoost and Random Forests. However, meta models can essentially be any other regression model too, like a Linear Regression, but since we’re talking about higher dimensional feature matrices here (time-delay embedding vectors), the formerly mentioned models seem to be better candidates. The authors experiment with multiple different regression models as choices for the meta learners and present their findings in the paper, in case you’re interested.

4) Error Metric: This is a crucial factor to consider, for training the meta learners as the errors in the base models are the target vectors for training these meta learners and how exactly these errors are computed and what scale they map to become important considerations. A typical choice is the MAPE (Mean Absolute Percentage Error) as this metric is a relative metric and normalizes the errors down to a percentage value. However, this metric is prone to bias in the case of small actuals. For instance, a prediction of 2 for an actual value of 1 results in a 100% MAPE, which can be misleading at times. A common alternative is the RMSE which is an absolute error and retains the scale of the actuals. However, this metric is prone to scaling issues and it becomes difficult to compare two time series of two different scales using this metric. Common solutions to this problem involve some kind of transformation applied to all the time series in order to first normalize them to a comparable scale.

The parameters discussed so far concern a single iteration of the walk-forward. The last two parameters that we will talk about however, are related to a key idea that spans all iterations, which we discussed in the introduction of this article — an individual base learner cannot be practically expected to perform consistently well throughout the time series. Hence, the notion of leveraging localized expertise of some models in some regions of time and others during other regions of time. The authors thus, introduce two more parameters that are used to filter out the non-expert base models for that region of time.
5) Omega: This is just the length of time in recent history to consider for evaluating the average error metric of all base models on, so that the models with the worst averages, can be discarded from this ‘committee’ of models.
6) Alpha: This is the number of top base models to consider in the ‘committee’ of models that then have a say in the combined forecast for the given timestep

Wrapping Up!

Hopefully, this has given you enough of a conceptual and a mathematical intuition around the motivations behind the Arbitrated Dynamic Ensemble Model to try it out yourself! Of course there are several other things one can experiment with like normalization, feature-engineering and what not! The purpose of this article was just to give you a quick primer on a novel ensemble technique that is specialized for time series forecasting problems, but if you’re interested in more of the geeky stuff, I’d highly recommend giving the original paper a read. Trust me, it will be worth your time!

But what’s a technique without a word on its applications, right? Let me quickly talk about why and how I have used this model. Unfortunately, due to company policies around restrictions of data privacy, I am unable to show real data here. Nonetheless, the results represent the real outcomes, just with dummy labels.

A quick background on the problem statement: I work for a team that helps support 40+ operational teams plan their resources so that they are better able to handle incoming demand. My role is to generate demand forecasts for each of these operational teams. We have a suite of models that we’ve developed ranging from Simple Moving Averages to ARIMA to Exponential Smoothing Models. However, the challenge lies in the varied nature of these different teams. No single model does well for all teams, and also even for a given team, no single model does well for the entire length of the time series, which in our case, is close to 4 years — because at different points in time, many of the the teams exhibit dynamic trends and seasonality. That’s when I stumbled upon this paper and thus, the adventure began!

Below are sample results from some experiments (with different choices of weight layer and alpha) run on one of the teams. Notice that in all three of these experiments, the ADE performs the best in terms of MAPE as compared to the other top 4 base models. (The ADE’s MAPE is better than the MAPE of the best base learner model by about 2% points, which is quite significant, given the range of the RMSEs) The meta model used here was fixed to a GBM model.

It can be tempting to draw generalized conclusions from such results about the best choice of hyperparameters like weighting strategy and alpha, but its important to realize that what may work best for one team (one time series), might not necessarily hold good for another. In such a situation, hyperparameter tuning methods like gridsearch come in quite handy. Considering also the fact that other than the 6 parameters (or degrees of freedom) defined above, we also have the meta-model’s own set of hyperparameters that we can tune, this would scale up the number of possible combinations and thus the model training time, exponentially. A parallelized version of gridsearch using spark RDDs could come in handy, as described in this Databricks blog.

The point is, there are a lot of different cogs to this machine and the best combination really depends on the data being modelled and the available set of base learner models, so don’t be afraid to experiment!

For those of you who are R users, the authors of the paper have released an R package for this model called tsensembler.

We however, built our code from scratch in python as the rest of our base models and the main code base were already in python. This however, gave us some flexibility to generalize our version of the ADE to work with any choice of base models, unlike the R package which limits the available base models to a only a few. Unfortunately, I am unable to share any of my code here, but the fact that I could code up a basic version of the ADE from scratch by reading the paper should be motivation enough for you to put on your solution-designer hats and get going! If at all you are stuck somewhere, I am more than happy to help out where I can — just drop in a comment here!

A word of advice-your ADE can only do as well as the best of your base learner models (its their weighted average), so its still essential to focus your attention on improving the performances of those individual models first! Also, it remains to be seen how well the ADE performs with multivariate time series problems — we’ve so far, tried it only on univariate base learner models.

The authors experimented with this model on several kinds of datasets and forecasting problems ranging from Energy Demand Forecasting, Solar Radiation Forecasting and Ozone Level Detection. This should give an indication of the robustness of this algorithm across several domains!

Finally though, a philosophical note to wrap things up! The reason why I find the ADE so beautiful, is because of the parallels I am able to draw with life — my own, in particular. I am a jack of several trades (and I’d like to think, a master of some!) I enjoy drumming and singing on some days. On others, I turn to creating short animation films. On other days still, I simply like to jot down my thoughts. Point is, I find it really hard to continue doing tomorrow what I enjoyed doing today — that’s just who I am. But realizing this fact itself, gives me peace — that I need to do what’s really right for that moment — leverage localized expertise!

Follow me on my LinkedIn profile or leave a comment here, and I’d be happy to respond!

--

--

Pratham Goel
Analytics Vidhya

A Data Scientist with a background in Electrical Engineering, passionate about harnessing the power of data to make renewable energy more accessible