Context-Aware Fast Food Recommendation at Burger King with RayOnSpark

Jason Dai
Jason Dai
Jul 8, 2020 · 7 min read

Authors: Luyang Wang (, Kai Huang (, Jiao Wang (, Shengsheng Huang (, Jason Dai (

Deep learning based recommendation models have been widely used in real world recommendation systems. Common methods perform concatenation of user and item embedding vectors, then feed them into MLP (multilayer perceptron) to generate final predictions. However, these methods fail to capture real-time user behavior signals and do not take the important context features (such as time and location) into consideration; as a result, the final recommendations are not ideal to reflect the real-time user preferences. User behavior sequences and context features become even more important for fast food recommendation because:

  1. Users are not likely to purchase another soft drink when they already have soft drinks added in the cart.
  2. User purchase preference can drastically change given location, time, and current weather conditions. For example, people almost never buy kids meals at midnight and are very unlikely to buy frozen drinks on a cold rainy day.

In this blog post, we present our Transformer Cross Transformer (TxT) model that exploits the sequence of each order as well as the context information to infer a user’s preference at the moment. The key advantage of our model is that we apply Transformer encoders to capture both user order behavior sequence and complicated context features and combine both transformers through latent cross to generate recommendations.

In addition, we have leveraged RayOnSpark in Analytics Zoo to build an end-to-end recommendation system using Ray*, Apache Spark* and Apache MXNet*. It integrates data processing (with Spark) and distributed training (with MXNet and Ray) into a unified analysis and AI pipeline, which runs on the same cluster where our big data is stored and processed. We have successfully deployed the recommendation system at Burger King, and our solution achieves superior results in the production environment.

TxT Model for Recommendation

We propose the Transformer Cross Transformer model (TxT), which uses a Sequence Transformer to encode guest order behavior, a Context Transformer to encode context features (such as weather, time and location), and then uses an element-wise product to combine them (the “cross” part) to produce the final output, as shown in Figure 1. We implement our model code leveraging MXNet API.

Figure 1: TxT Model architecture.

Sequence Transformer

We construct a Sequence Transformer, based on the Transformer architecture, to learn the sequence embedding vector of each item in the guest order basket, as shown in the lower left part of Figure 1. To ensure that the item position information can be considered in its original add-to-cart sequence, we perform positional embedding on input items in addition to the item feature embedding. The embedding outputs are then added together and fed into a multi-head self-attention network.

To extract the vector representation of the whole guest order basket information from the hidden vectors of each item, we concatenate mean pooling and max pooling separately against final sequence transformer output. In this way, pooling output can consider all products contained in the product sequence while focusing on a small number of key products and their salient features.

Sequence Transformer can be constructed using the API in Analytics Zoo below:

Context Transformer

A common way to incorporate context features is to directly concatenate them with sequential inputs. But it is less meaningful to simply concatenate non-sequence features with sequence features. Some previous solutions use element-wise sum to deal with multiple context features. However, sum can only represent how context features aggregately contribute to the output, but most of the time these context features do not contribute equally to a user’s final decision.

Therefore, we use a Context Transformer to encode the contextual information, as shown in the bottom right part of Figure 1. Using Transformer’s multi-head self-attention, we can capture not only the individual effect of each context feature, but also the internal relationship and complicated interactions across different context features.

Context Transformer can be constructed using the API in Analytics Zoo below:

Transformer Cross Transformer

To jointly train Sequence Transformer and Context Transformer, we perform an element-wise product between these two transformer outputs. Through this cross Transformer training, we are able to optimize all the parameters such as item embeddings, context features embeddings and their interactions at the same time. Finally, we apply relu as the activation function followed by a softmax layer to predict the probabilities of each candidate item.

TxT, which consists of Sequence Transformer and Context Transformer, can directly be constructed using the API in Analytics Zoo below:

End-to-End System Architecture

Conventional approaches to build a standard recommendations pipeline would set up two separate clusters, one for big data processing, and the other dedicated to deep learning (e.g., a GPU cluster). But this not only introduces a lot of data transfer overhead, but also requires additional efforts for managing separate workflows and systems in production. To address these challenges, we have built the recommendation system on top of RayOnSpark in Analytics Zoo, which integrates Spark data processing and distributed MXNet training (using Ray) into a unified pipeline that runs on a single Xeon cluster.

Figure 2 illustrates the overall architecture of our system. In the Spark program, a SparkContext object is created on the driver node and it is responsible for launching multiple Spark executors to run Spark tasks. RayOnSpark additionally creates a RayContext object on the Spark driver, which will automatically launch Ray processes alongside each Spark executor and create a RayManager inside each Spark executor to manage Ray processes (e.g., automatically shutting down the processes when the program exits).

Figure 2: Overview of the recommendation system based on RayOnSpark.

In our recommendation system, we first launch Spark tasks to extract our restaurant transactions data stored on distributed file systems, followed by data cleaning, ETL and preprocessing steps using Spark. After the Spark tasks complete, the processed in-memory Spark RDD are directly fed into the Ray cluster through Plasma for distributed training.

Inspired by the design of RaySGD, we have implemented an MXNet Estimator that provides a lightweight shim layer to automatically deploy distributed MXNet training on Ray. Both MXNet workers and parameter servers run as Ray actors, and they communicate with each other via the distributed key-value store provided by MXNet; each MXNet worker takes its local data partition in Plasma to train the model. As a result, the user can seamless scale the MXNet training code from a single node to production clusters through Ray, using a simple scikit-learn style API below:

Such a unified design architecture integrates Spark data processing and Ray-based distributed MXNet training into an end-to-end, in-memory pipeline, which runs on exactly the same cluster where our big data is stored. Consequently, we only need to maintain a single cluster for the entire AI pipeline, with no extra data transfer across different clusters and no extra cluster maintenance efforts. This achieves the full utilization of the cluster resources and significantly improves the end-to-end performance of the whole system.

Model Evaluation

We conducted offline experiments using the customer transaction records of Burger King in the past 12 months. The historical data of the first 11 months is used as training data and the last month is used for validation. The models are trained based on these data to predict the next best product for the guest to purchase. From Table 1, We can see superiority of TxT over baseline models (including Association Rule Learning and GRU4Rec). When comparing TxT and GRU4Rec, we can see that incorporating various context features greatly improves the Top1 and Top3 accuracy (by approximately 5.65% and 7.32% respectively).

Table 1: Offline training results of different recommendation models.

To evaluate the effectiveness of our TxT model in the real-world production environment, we ran our recommendation system in Burger King’s mobile application side by side with Google Recommendation AI*, a state-of-art recommendation service provided by Google Cloud Platform (GCP)*. We evaluate online performance from two aspects: recommendation conversion rate and add-on sales. We ran A/B testing for 4 weeks. For the control group, we randomly select 20% users and present them with a previous rule-based recommendation system. As shown in Table 2, TxT improved recommendation conversion on the checkout page by 264% and add-on sales by 137% when compared to the control group. This also stands for +100% conversion gain and +73% add-on sales gain when compared to other test groups running GCP Recommendation AI service.

Table 2: Online results of different recommendation solutions.


This blog post describes how we build and productionize an end-to-end recommendation pipeline in Burger King. It successfully captures user order behaviors and complex context features through the Transformer Cross Transformer (TxT) model, and implements a unified data processing (with Spark) and DL training (with Ray) pipeline using RayOnSpark. Both the TxT model and RayOnSpark have been open sourced in the Analytics Zoo project.

*Other names and brands may be claimed as the property of others