DeepFactors for forecasting, architecture review

Alvaro Durán Tovar
Deep Learning made easy
5 min readNov 13, 2020

Today I’m going to be reviewing a deep learning model architecture for forecasting.

Photo by Johannes Plenio on Unsplash

Until M4 competition everyone thought that traditional statistic models were superior for forecasting, as happened with many other fields I would say. Today it seems deep learning is the best solution almost always… at least having enough data. Traditional forecasting techniques like ARIMA, etc seems to be doing better for small datasets still.

What about huge datasets where we have many different time series? Like amazon for example, or any retail business. Would be amazing to be able to throw all of that into a model and make predictions for each product/user/… individually.

And would even better to be able to use exchangeable time series. Like we have the history for a given author, probably another book of the same author of the same topic will have similar evolution, or a book from another author but same topic.

And even better if we can solve the cold start problem for fresh new products for which we have similar data.

Too good to be true… well actually that’s what DeepFactors does!

Paper Deep Factors for Forecasting: https://arxiv.org/abs/1905.12417
Older paper about the same architecture: https://arxiv.org/abs/1812.00098

I have to recognize I struggled to understand it (well that happens to me with all papers...), so to make it easier for you here you have a comprehensive list of terms, but first some excerpt from the paper:

Our new method is data-driven and scalable via a latent, global, deep component. It also handles uncertainty through a local classical model. We provide both theoretical and empirical evidence for the soundness of our approach through a necessary and sufficient decomposition of exchangeable time series into a global and a local part. Our experiments demonstrate the advantages of our model both in term of data efficiency, accuracy and computational complexity.
[…]
In this paper, we propose a novel global-local method, Deep Factor Models with Random Effects. It is based on a global DNN backbone and local probabilistic graphical models for computational efficiency. The global-local structure extracts complex non-linear patterns globally while capturing individual random effects for each time series locally.

Fixed? Random? Global? Local? What’s that?

  • Fixed effect model: This just means that the model is deterministic, the output is always the same for the same input.
  • Random effect model: The output is random, based on some probability distribution, ie. non deterministic.
  • Global model: In this case refers to a model that learns some information shared across many time series.
  • Local model: One model per time serie. Traditional statistical methods such as ARIMA are of this type.
  • Global-local: Some sort of combination of shared high level information between time series and specific information for each specific time serie. First the global information is calculated and used to predict the local, specific, time serie.

Architecture

In the following graph we have the theoretical architecture of the model. Note about “g” and “r”.

g: global effect, common for all (no i subindex).
r: random effect, specific for each one.

Ok, not much to say about it, you have all you need to understand it on the paper. Let take a look on the implementation on gluon-ts.

DeepFactors gluon-ts architecture

The following schemas is based on gluon-ts repo code. I always find useful to see how to implement the esotherical maths from the papers.

Global model: The global model only receives the time features, that is, the time series past data plus some other useful features like hour of the day, day of the week, etc.

Deep Factors: The global deep factors comes from applying a matrix multiplication to the output of the LSTM, that’s almost the same as saying that we project that output into a different dimensional space, on the paper they use 10 as the rank of this new space.

Fixed effects: Now given those global factors we want to use them differently per category, therefore an attention block with the categories as input is used. The use of the category through an embedding layer should help to focus on different parts of the factors for each exchangeable time serie.

And with all of this we have the deterministic part of the model, the average without any kind of uncertainty associated. The output is 1 number per batch item.

Random effects: This is calculated from a combination of input features concatenated with the embedding of the category. The output is 1 number per batch item.

And now we have 2 numbers we can use as parameters of a gaussian distribution (code based on the gaussian process alternative).

Improvements

I’m far from being smarter than the people that coded this beautiful thing, but there are somethings I would like to try to see if it helps:

  • QRNNs: https://arxiv.org/abs/1611.01576 Improved training speed.
  • Activations: There are no activation layers in any part if I understood the code correctly.
  • Different embedding objects for the fixed effects module and the random effects module. It’s shared on gluon-ts.
  • Forcing the standard deviation, the output of the random effects module, to be higher than 0 always (using mx.nd.clip for example). On gluon-ts they use this neat trick:

log(exp(x) + 1)

That function have an asymptote on 0, should never be 0. But it can be so close to 0 that for a computer it’s indeed 0, causing NaNs.

--

--