Transformer Networks for Demand Forecasting

Published in

The Telegraph Engineering

11 min readMay 12, 2021

Introduction

Continuous innovation has always been at the core of the Telegraph’s data teams. Whether to create new products and services or to implement novel methods to optimise, improve and maintain our current products.

Innovation also plays a crucial role in the success of the Data Science team at the Telegraph. A team is responsible for identifying areas of the business that could benefit from predictive modelling and advanced data-driven insights and recommendations.

In line with these expectations, one of the areas of the business that has been benefiting from such predictive modelling on a large scale since 2019 is the Telegraph print circulation. As explained in our previous article, the work on optimising print circulation at the Telegraph started before 2019. Since then, various solutions have been developed and tested to maximise newspaper sales while minimising the number of unsold copies sent to retailers every day across the United Kingdom.

Having unsold copies of the newspaper left at the retailers at the end of the day incurs a cost, which varies based on the product and the day of the week. Hence, the above two business KPIs, i.e. to maximise newspaper sales while minimising the number of unsold copies, could be unified into a single optimisation goal for our machine learning system, i.e. maximising circulation profit. Basically, our system tries to find the right balance between the supply and demand (cost of losing sales vs cost of ending up with unsold copies per retailer) to maximise profit, moving away from the maximise-sales-at-any-cost models.

Our Legacy System

In the fast-paced world of big data and machine learning, even a few-year-old system could easily be considered a legacy, if it falls behind the latest efficiency, success and maintenance standards of the most recently designed intelligent systems.

While delivering exceptional performance when it comes to maximising profit our legacy system had a few shortcomings, a few of which are discussed below.

System Complexity

Being able to predict multiple steps ahead is one of the main requirements of the Telegraph circulation optimisation system, as the planning ahead and printing varies for different editions and variants of the daily and Sunday Telegraph titles. Flexibility in days ahead forecasting is also required for planning certain holidays and events.

Our legacy system relied on two separately trained neural networks to achieve multi-step-ahead forecasting. One to predict the most accurate future sales (to move a step ahead) and the other to predict the most profitable future supplies (for the final step).

This approach, also known as “Recursive Multi-step Forecasting” is presented below for the task of generating a 3-step-ahead forecast.

Although performant and flexible, the maintenance and updating process for two neural networks requires much more resources, especially when re-trained a couple of times a year. Also, before deployment in production, both models need to be back-tested and pass the safety checks. This significantly increases system update/upgrade times. Additionally, the code handling the multi-step ahead forecasting by two networks is computationally costly and has higher runtimes compared with a single model approach. Having and maintaining this extra logic (script) by itself increases the complexity of the system, which brings us to our final point, potential points of failure.

The more complex the system the higher the chance of failure in any of its sub-systems. Therefore this approach compared to a single network approach, has increased risk of failure, longer debugging times, higher costs and requires additional cloud resources.

Therefore, while modernising the system, the need for reducing its complexity without impacting its performance was identified.

Manual Supervisor Control

Our legacy system was initially designed with fully automated forecast delivery to the Telegraph wholesalers in mind. However, due to the nature of business, there have been occasions, where a manual top-up (a regional or wholesaler level increase/alteration) was required. Additionally, retailer level supply overrides are occasionally required, as retailers can request a certain level of fixed supply on certain days of the week.

Although these additional components were included as add-ons to the system to allow for such functionality, they were never part of the core system design. Hence, the alterations and overrides functionalities were never as efficient and user friendly as they could be. This resulted in longer runtimes for alteration and override scripts, while additional technical support was occasionally required for manual alteration re-runs.

Additionally, in this system, any high-level summary of insights regarding the daily forecasts were being emailed to relevant Telegraph teams. This increased the complexity of tracking and monitoring the forecasts internally.

Therefore, our internal teams could significantly benefit from a unified and user-friendly environment for occasional control and monitoring of the forecasts, at a regional, wholesaler or retailer level.

Our Modernised System

Simply put, everything has changed in our updated system. Fresh scripts were written for training data generation, data pre-processing, neural network training and architecture, forecasts post-processing, applying alterations, monitoring and cloud deployment.

Direct Multi-step-ahead Forecasting

One of the main changes in the modernised circulation optimisation system is switching the forecasting strategy from “recursive” to “direct”. This means that our modern system is capable of forecasting multiple days ahead at once. We were able to ensure that the requirements for print planning in advance are met under any conditions by setting a long enough steps-ahead window (long enough output sequence length). Below is an example of a 3-step-ahead forecast executed by this forecasting strategy.

This approach results in simpler data pre-processing and elimination of the “accurate sales forecasting neural network” in our legacy two-network system. Additionally, we could avoid forecast error accumulation in later steps, which the recursive forecast strategies are prone to, by utilising the direct forecasting strategy.

Sequence-to-Sequence Neural Network

The biggest change happened to the core of the system, the neural network architecture. We designed the new architecture to ingest the historical sales sequences alongside their time-variant and time-invariant metadata (retailer types, locations, holidays, etc) and output a sequence of optimal (profitable) supplies for multiple days in the future. This could also be described as a multivariate multi-step-ahead forecasting network.

As mentioned earlier, this approach reduces the overall system complexity. This results in an improved and faster training/retraining process decreases the forecast generation run durations and requires significantly less cloud and technical resources to run and maintain.

This not only eliminates the need for maintaining a separate logic to control multiple neural networks and forecast multiple days ahead but also simplifies deploying it as a single API with an easily upgradable core network.

By scheduling this API to generate predictions every day, the forecasts for multiple days ahead will always be available in our forecast tables on Google Cloud BigQuery. Therefore, even if in an extremely rare situation, a scheduled run fails, there will always be days-ahead forecasts available to be used (although slightly less accurate), until the issue with the run is rectified. Hence, this also acts as a safety feature, increasing the overall reliability of the system.

Our optimisation problem is very non-convex; i.e. maximising distribution profit among all Telegraph retailers across the UK with various characteristics, sales volumes, seasonality and trends. Therefore, designing a neural network architecture was one of the main challenges of this new sequence to sequence network.

In recent years, Transformer Networks have proven extremely successful in sequential modelling, especially for Natural Language Processing (NLP) use cases. We decided to design a transformer-based architecture to forecast multivariate time series instead of sequences of words and letters.

Although we will not go into the details of the network and blocks’ architecture in this article, the final network is based on the network introduced in the original transformers paper “Attention is all you need” paper (Vaswani, et al., 2017). However, there have been various changes to the original network, to adapt it for time-series forecasting. These changes include — but are not limited to — the removal of “word embeddings”, replacing the original “Positional Encoding” with our custom “Sequential Encoding” layer and transforming the decoder outputs from probabilities to linear outputs.

Additionally, our input sequence (lookback window) supports up to a year’s worth of historical data and our output sequence has a length of 14, for a maximum of two-weeks-ahead forecasting. The illustration below presents a very simplified version of this architecture.

Distributed Training

Prior to training, our data gathering and pre-processing modules prepare and save our training and validation data in serialised TFRecord format (using protobuf) on Google Cloud Storage. We designed the pre-processing in a way that the same module can be used later to prepare the real-time data for daily forecasting, simplifying the overall approach and avoiding errors in the future.

The TFRecord files ensure the parallel inflow of data during training using TensorFlow’s tf.data API. This allows us to optimise the utilisation of CPU, GPU, and minimise their idle time, which results in a significantly faster training process. Additionally, this allows us to read the data in batches directly from the cloud storage and pass-through to the neural network during training. Therefore, there would be no need for loading the complete training data in RAM, virtually providing us with infinite training data size overhead. This is specifically useful for this project as we have Terabytes of historical training data available to us.

The training process for our network takes place on Google AI Platform. The snapshots of our training scripts and the model architecture files are initially uploaded to Google Cloud Storage. Then the Google AI platform initiates the training process using our training script.

As our network has several millions of trainable parameters, training on CPU or even a single GPU is slow and not feasible. Instead, we utilise TensorFlow’s distributed training strategy, and more specifically the “Mirrored Strategy” to train our network on 8 high-end GPUs at the same time.

“Mirrored Strategy” simply creates a replica per GPU on the device running the training. Then each network variable is mirrored across all these replicas, forming a single conceptual variable, which is called the “Mirrored Variable”. By applying identical updates, these variables are always kept in sync with each other. To communicate these variable updates to all devices, all-reduce algorithms are used to significantly decrease the overhead of synchronization in an extremely efficient way.

The training script running on Google AI Platform saves checkpoints during the training on Google Cloud Storage and ultimately saves the fully trained network (a.k.a. trained model) before termination of the training process.

The trained network can then be served as an API under the “models” section of the Google AI Platform. This will enable us to simply monitor the requests, response times and overall performance of our forecasting API without coding the API manually.

Control Sheet

As mentioned earlier, one of the important aspects of our system is occasional manual supervision over the forecasts being communicated to Telegraph wholesalers. As this process is very time-restricted, we needed to ensure any occasional high-level top-ups could be applied swiftly to the forecasts before providing the final forecasts for next day distribution.

Therefore, we decided to design a unified Google Sheets based control environment to supervise the forecasts both at the retailer level and at the area/high-level, apply any high-level top-ups by area or retailer types and repeat this process as required before submitting to wholesalers for distribution.

This required us to write our own logic to efficiently communicate with the Google Sheets API and to be able to upload large amounts of data in multiple tabs swiftly and in parallel. Having our own sheet control functions ensures a smooth run of our daily forecast and wholesaler communication process and it could not be impacted by delays in adapting third party packages to new Google Sheets API updates.

Our control sheet contains four main categories of tabs that are briefly described below:

Alteration Tabs: To enter occasional top-ups by percentage or volume per area or wholesaler and to import retailer requested volumes.
Summary Tab: To show a per wholesaler and per day ahead summary of forecast volumes before and after alterations and the predicted impact of alterations on the profit of that specific region or day
Forecast Tabs: To display retailer level forecasted supplies for all available days in the future (up to the maximum sequence length of the network, i.e. 14 days ahead).
Process Tab: To trigger the function to apply the requested alterations and generate new forecasts, forecast tabs and summary tabs. Also to trigger the package forecasts and email function.

The “apply alterations” or “package and send email” functions in the control sheet are in fact Google Apps Script functions that trigger the run of the relevant pod on Google Kubernetes Engine (GKE). Each function also saves a snapshot of the applied alterations for future analysis and analysis of the profit impact.

Cloud Architecture

Every component and module of our modernised system is either deployed or runs on the Google Cloud Platform. This includes the raw data, pre-processing scripts, training and validation data generators, the trained network’s API, the scripts to receive daily forecasts from the API and post process them, the post processed forecasts themselves and ultimately the control sheet and its relevant functions.

The main components of the Google Cloud Platform that are used extensively across the project are the following:

Google Cloud Storage
Google Cloud BigQuery
Google AI Platform
Google Cloud Composer
Google Kubernetes Engine

A high-level and summarised overview of the cloud architecture for our modernised circulation optimisation system is presented below.

Conclusions

Having identified the shortcomings of our legacy (yet very performant) system for optimising the Telegraph circulation, we aimed to approach the problem from a slightly different perspective. We successfully modernised all components of this system and significantly reduced the complexity of the system as well as reducing the required resources to run and maintain it.

Our new approach not only utilised the latest deep learning and cloud technologies to achieve great performance levels more efficiently but also paved the way to reuse the developed system for other applications across the Telegraph.

This signifies the fact that constant maintenance and modernisation of machine-learning-based systems is not only beneficial but essential in such a fast-paced and innovative environment.

Davoud Ardali is the Lead Data Scientist at The Telegraph. Follow him on LinkedIn & Twitter.