A Better Way? Forecasting with Embeddings

Emilio Lapiello
GAMMA — Part of BCG X

--

Written by Vince F, Emilio Lapiello & Andrew Fowler

There are a wide variety of models and tools designed to tackle time series forecasting problems: ARIMAX, exponential smoothing, Kalman filters, RNN, and LSTM, to name just a few. But the time series forecasting techniques most data scientists usually use to leverage historical data don’t always yield the level of granularity desired, and their implementation can be complicated or even impossible in certain scenarios.

There might be a better way. We find that a feedforward neural network with embeddings layers constitutes a straightforward and interesting non-recurrent deep learning architecture that provides excellent forecasting results and shows clear benefits, especially when:

1. The historical data available is limited. Traditional time series methods need multiple seasonal cycles to perform optimally. In practice, that means a model must be trained on multiple years of data — which are frequently not available.

2. There are many distinct, but related, time series. When generating multiple related forecasts — for example, forecasting sales at each of a retailer’s stores — traditional time series techniques (e.g., ARIMAX) need to be applied to each store independently, making it hard for a single model to leverage the data generated by all stores at once such as system trends, shopper behavior during holidays, etc.

3. There are many high cardinality categorical variables. Common techniques for handling categorical variables such as dimensionality reduction or one-hot encoding can be time-consuming, computationally complex, and may require extensive model regularization.

For instance, in a multi-store example, the model can learn from a mix of exogenous and auto-regressive features. The embedding layers allow the model to learn from distinct stores’ time series at once by embedding the store IDs, or to encode categorical features in a meaningful way (e.g., holidays, weather, geography, etc.) in a lower dimensional space to extract valuable information.

An example of forecasting with embeddings

Let’s take an example of a classic sales forecasting problem for a large retailer, where we have access to weekly sales data from thousands of stores. We can easily manipulate and augment the data to obtain the following feature set:

  • Past sales data (e.g. 1,2,…, n weeks ago)
  • Time-related variables (e.g. month, week of year, day of week, day of month)
  • Store ID and/or department ID
  • Promotion features and sales calendar
  • Holidays (e.g. Christmas, Labor Day, etc.)
  • Weather data (e.g. temperature, pressure, and precipitation forecast)
  • Geographic data (e.g. state, city, ZIP code, store accessibility)

Each data point is a pair x(i), y(i) where y(i) is the weekly sales and x(i) is the feature vector.

The embeddings

Some of the features listed above are categorical with high cardinality. For instance, x_storeid takes as many values as there are stores in the dataset (e.g., thousands of values). Using a classic encoding method such as one-hot encoding might not be very effective, as it blows up the dimensionality of the input feature vector and greatly increases its sparsity.

One way to deal with such features is to use embeddings. As the TensorFlow team notes, “an embedding…stores categorical data in a lower-dimensional vector than an indicator column.” If the feature is in ℝᴺ , an embedding is simply a map from ℝᴺ → ℝᵐ where m is much smaller than N. The corresponding vector in ℝᵐ is called the embedded vector, and it is much denser and smaller than in its original ℝᴺ space.

For those familiar with models such as word2vec, it’s exactly the same concept. Note that the dimensions of the latent space in ℝᵐ is one of the model’s hyperparameters. In other words, the choice of m will impact model performance and as such, needs to be chosen carefully.

Embedding layers’ weights and biases are initialized randomly and then learned with the rest of the neural network’s parameters via backpropagation, just like any other layer.

Their optimized weights and biases minimize the error on the task the neural network is trying to perform — in our case, to predict sales. It could be that if two holidays are similar in terms of how they impact sales, their embedded vector representations will be very close to one another in the latent space ℝᵐ. In some ways, the learning algorithm is allowed the flexibility to transform the input it’s given for certain features into something more meaningful in order to minimize its prediction error (i.e., loss).

Let’s take a concrete example. Assume there are only six mutually exclusive holiday states: no_holiday, columbus_day, independence_day, Christmas, labor_day, new_year.

We can build and train an embedding layer for this feature from ℝ⁶ to ℝ³ such that the sparse one-hot representation of independence_day might be x_holiday = [0,0,1,0,0,0], but -for instance- its embedded representation as a dense vector in ℝ³ is [4.2, -0.8, -1.8].

Once trained, we can extract the weights of the embedding layer and visualize the mapping, as shown here:

Embedded representation of the ‘holiday’ feature

As we can see, Columbus Day and Independence Day are closer and therefore more similar to one another in the latent space than No holiday.

The architecture

In its simplest form, the neural network architecture used in this problem is shown here:

A basic Neural Network architecture with Embeddings

In considering the architecture, it’s important to keep in mind the following:

  • Not all features require embeddings; each one should be treated on a case-by-case basis (e.g., normalization, encoding, binning, etc.)
  • The embedding features are implicitly one-hot-encoded before being embedded; that’s why they appear as one-dimensional in the input layer.
  • Deep learning architectures tend to perform better when fed unstructured data without much feature engineering. For example, feeding in the past 0,1, …, n weeks’ worth of sales and letting the learning algorithm build the autoregressive features it needs from the raw data (e.g., rolling mean) to minimize the loss.

The training

The parameters θ of the neural network f are trained via the usual optimization problem, which is aimed at minimizing the cross-entropy between training data and the model distribution:

If we let:

we then recover the mean squared error loss:

which yields:

The above loss is then optimized via stochastic gradient descent. For more information, you can reference the book “Deep Learning” by Ian Goodfellow, and Yoshua Bengio, and Aaron Courville, from where the equation was taken.

Putting it to the test

Using a publicly available data set of weekly sales from a large retailer, we trained and tested such architecture and obtained solid results.

As a benchmark, we used simple ARIMA(3,1,0) models fitted on the time series of every individual store/department and running a one-step forward prediction. We retrained those models every time a new data point was observed.

We also trained and tested a random forest to assess the performance of a simple off-the-shelf, non-linear model against the same dataset. The metrics below are computed on unseen test data pulled from about 180 stores/departments over a period of two and a half years:

More interestingly, the power of the embeddings are highlighted when we compute the performance metrics exclusively on the holiday data points, which tend to have erratic weekly sales behavior:

A potentially better way

Deep learning has traditionally been seen as an effective tool, but one whose application was mainly limited to specific prediction and classification tasks related to images, sound, and text data. When applied to time series and forecasting, recurrent architectures (e.g., RNN, LSTM) are the preferred ones.

But when there are numerous high cardinality categorical variables, the available historical data available is limited, and/or there are a multiple distinct but related time series, a simple feedforward architecture with embeddings can offer an easier, more effective — and more efficient — forecasting tool than traditional time series techniques.

Its efficiency becomes especially relevant when generating predictions for uncommon days such as holidays, where traditional methods tend to be more challenged, and may require dedicated models and tuning to make them work.

For additional insights, be sure to check out fast.ai, where Jeremy Howard and Rachel Thomas, along with Sylvain Gugger, write extensively about the power of deep learning. Another good resource is Howard’s video series on YouTube.

--

--