Visualising Transformer Self-Attention to Explain Customer Recommendations

Steven George
Gousto Engineering & Data
8 min readJan 9, 2023

In this volume we describe how we implemented a Transformer model for recommendations at Gousto and visualise the Transformer’s attention mechanism to understand the customer signals the model draws upon to make a prediction.

At Gousto we solved the cold-start customer problem of recommendations by modelling customers as a sequence of the recipes they ordered, rather than by learning individual user embeddings. Taking this approach paved the way for the use of a state-of-the-art architecture in Machine Learning, namely the Transformer.

A quick recap of Transformers

The Transformer, first introduced in the seminal paper “Attention is All You Need” by Vaswani et al., is a Deep Learning architecture for processing sequential data. Whilst initially proposed for the task of machine translation the Transformer architecture has proven to be incredibly versatile and has since been applied to other tasks in natural language processing as well as computer vision, reinforcement learning and speech recognition.

One of the core components of the Transformer architecture is the multi-head self-attention mechanism. We’ll give a flavour of how this works by decomposing this mechanism into two parts: ‘multi-head’ and ‘self-attention’. For a detailed description of the maths underpinning multi-head self-attention we refer the reader to the original paper as well as the excellent blog post by Jay Alammar.

Intuitively the self-attention mechanism enables the model to pay varying levels of attention to different items in an input sequence depending on the item it is currently processing. Let’s look at an example with recipes to solidify this concept.

Illustration of self-attention mechanism for recommendations. In this example we visualise the attention scores for the Soy-Glazed Chicken With Japanese-Style Slaw recipe.

The items in the sequence are the historical recipes ordered by a given customer and represent the input for the Transformer-based recommender (more on that later). For each recipe in the sequence the model will construct a representation, formally an embedding, which factors in the other recipes in the sequence. We visualise the attention paid to different recipes in the sequence when the model is constructing a representation of the recipe titled ‘Soy-Glazed Chicken With Japanese-Style Slaw’.

In this example the model pays most attention to the selected recipe itself. It then pays attention to other recipes similar in nature to ‘Soy-Glazed Chicken With Japanese-Style Slaw’, namely ‘Korean-Style Yang-Nyum Fried Chicken’ and ‘Chicken Teriyaki With Rice And Peas’. We observe that these recipes share similar characteristics both descriptively and visually (Asian-style chicken and rice dishes). Conversely the model assigns very little attention to recipes which are markedly different to the selected recipe such as ‘Simply Perfect Lean Beef Spag Bol’.

This ability of the model to form associations between recipes in a sequence results in powerful contextual representations for each recipe in the sequence, where the context is the set of other recipes in the sequence. The contextual representations are then highly predictive features for downstream tasks like recommendations.

The multi-head part refers to the fact that the model doesn’t do self-attention once but rather several times. You can picture the model producing several self-attention heatmaps for every sequence it processes, each of which have learned to focus on a different property of the recipes. In the self-attention head above the model formed strong associations between recipes with similar attributes but another self-attention head may form associations between recipes which are different in nature but complement each other such as burgers and chicken wings. Combining both types of associations results in even richer contextual embeddings.

Transformers for recommendations

The model architecture used at Gousto is loosely based on the Behaviour Sequence Transformer for E-commerce Recommendation by Alibaba presented below. It is an encoder-only architecture where the task of recommendation is framed as a binary classification problem. We use implicit feedback in the form of customer orders as our labels.

Alibaba’s Behaviour Sequence Transformer for E-commerce Recommendation

There are some notable differences between our model and the one proposed by Alibaba. Firstly, the sequence passed to the Transformer encoder consists of the historical recipe purchases made by the customer concatenated with the target recipe, the one which the model has to rank. We truncate and pad the historical sequence to a length of 50 and use 8 self-attention heads in our model.

One of the benefits of the Transformer is that it does not require recurrence and instead encodes sequence order through the use of positional embeddings. In our model we learn these embeddings from scratch as part of model training. We also perform element-wise addition of the recipe and positional embeddings rather than a concatenation.

Finally, in keeping with the Alibaba model, we only use the contextual embedding for the target item rather than a concatenation of all contextual embeddings before passing on to the multi-layer perceptron.

Opening the black box — what has the model learned?

The self-attention mechanism inside Transformers provides a unique insight into how the model makes its decisions. We are able to see which parts of the input sequence (in our case the recipes ordered by a customer) the model pays most attention to when deciding whether or not to recommend a recipe. Hence we have a notion of ‘interpretability’ at a per-customer level which is not always possible with other neural models.

With self-attention every recipe in the sequence pays varying levels of attention to every other recipe in the sequence. Hence we have 51x51 attention values per head per recommendation. This can be visualised on a heatmap as shown below:

The rows are the attention maps for each item in the sequence (0 is the first recipe in the history, 49 is the last and 50 is the target recipe under consideration). The columns represent the amount of attention paid to every other item in the sequence.

We can see a clear diagonal in the plot with high attention values. What this tells us is that when the model processes each item in the sequence it pays most attention to itself. This is a reasonable thing to do — the most important thing to consider when processing a recipe is the recipe itself!

For the remainder of this post we look at individual examples of the attention map for the target item under consideration i.e. the final row in the grid above. This simplifies our analysis but also makes sense because we only pass the contextual embedding for the target item to our final feed-forward layer. Additionally, we average the attention values across our 8 heads to produce a single heatmap.

In the following plots the far right box is the target recipe. The colours highlight the attention paid to the recipes in the given customer’s history as well as the target recipe. We use a viridis colour scale shown below, where attention increases from left (purple) to right (yellow):

Repeat purchases

The heatmap above is a true positive recommendation for ‘Simply Perfect Lean Beef Spag Bol’, one of our Everyday Favourites recipes which is always available. We can see that for the given customer this is a recipe they order on a regular basis and the model has successfully identified this by paying more attention to these purchases compared to others. Although the most attention is paid to the target item the sum of attentions over all previous purchases would exceed this.

Back on the menu

We launch new menus every week and sometimes it can be a few months before your favourite recipe is available to order again. One product feature we discussed internally was the ability to surface recipes previously ordered when they returned to the menu. This could be executed with a naive rule-based approach.

Thankfully the Transformer model has learned to do this itself!

In the example above the target recipe is the ‘Pistachio Pesto With Tomato & Mozzarella Tortelloni’ from our delicious Pasta Pronto range. It was the top recommendation for this customer. The recipe in question was ordered once before and during the interim period it was not available on our menus. The model has not only recommended this item based on the customer’s previous purchase (replicating back on the menu) but has done so in an intelligent way. We note several high attention values on recipes similar in nature to the target recipe such as other ravioli and tortelloni dishes. The model has noticed that the customer ordered similar recipes in the interim period and would likely appreciate knowing that the Pistachio Pesto recipe had returned. This type of informed recommendation would not be possible with a simple rule-based ‘back on the menu’ feature.

It is also worth noting the model’s ability to leverage recipes early in the customer’s historical sequence (in this case the 4th recipe). This is one of the known advantages of Transformers over recurrent neural networks which can suffer from vanishing gradients with long context windows.

Previous purchases can sometimes be misleading

Visualising attention heatmaps also proves to be a useful debugging tool. Here we show a false positive example — the recommendation of ‘One-Pot Chicken Rogan Josh With Yoghurt’ was not ordered despite being ranked highly. As with the other examples the model has picked up on the customer’s previous purchase of this recipe and predominantly paid attention to this rather than the interim orders.

So why was this a poor recommendation? We won’t know for sure without asking the customer but a logical conclusion is that they did not like the recipe. Since we are using implicit labels we take orders to be a ‘like’. However as this example shows this may lead to false positives and suggests we should also incorporate some explicit feedback in our model such as recipe ratings.

Thank you for paying attention!

In this post we demonstrated how the state-of-the-art Transformer architecture can be applied to recommendations and how the attention mechanism can be visualised to explain predictions at a customer level. This proved to be so popular at Gousto that we even created an internal self-serve tool so everyone could explain their own recommendations with this method!

The other benefit of Transformers which is worth mentioning is that it makes it very easy to name model endpoints — our most recent version is called Optimus! Stay tuned to see what Bumblebee and Megatron look like…

Be sure to check out Volume 1 and Volume 2 of this series if you missed them. Whilst this volume focussed on the model a fundamental part of any successful ML application is the data infrastructure. You can read about our use of Feature Stores and their benefits here.

--

--