Using Deep Learning and Transformers to train Recipe Embeddings

DoYoung Kim

Published in

Gousto Engineering & Data

9 min readMar 18, 2022

How we used state-of-the-art Natural Language Processing to learn recipe embeddings!

Authors: Sheng Chai, Alexander Marinov, DoYoung Kim

How can computers understand what a recipe is?

At Gousto, we have very comprehensive dataset on recipes. Not only do we have raw data on ingredient list, nutritional information and cooking steps, we also have a lot of human labelled features such as cuisine type, dish type, and protein type.

Unfortunately many of these columns are either unstructured or have very high cardinality. The cuisine column itself has about 43 (and still growing) values! Representing such high cardinality columns in learning tasks is hard and requires a lot of feature engineering. The most common ways of doing this are:

One-hot encoding — this involves representing each attribute as 1s and 0s, e.g. ratatouille would have is_french column = 1 and is_italian column = 0. This however results in a very sparse dataset (very memory inefficient and leads to curse of dimensionality)
Binning/Grouping — this involves grouping categories into coarser categories before one-hot encoding e.g. combining french and italian into european. This however throws away a lot information! Whilst french and italian areeuropean cuisine, they are also very different!
Mean encoding — this means encoding categories with values based on the prevalence in the target labels. This however is very context dependent — and we have to do this feature engineering again and again for every algorithm that needs the cuisine column.

What are embeddings and how could they help?

Embeddings are numerical representations of an entity in N dimensions. It’s widely used on deep learning or natural language processing tasks, e.g. representing English words in numbers. There are many pre-trained embeddings available open-sourced online e.g. GloVe or Google Word2Vec. You can think of these as ‘coordinates’, and the idea is that words with similar meaning are located close together in these coordinates.

With embeddings:

We could reuse embeddings across the different algorithms we deploy at Gousto, avoiding reinventing wheels and laborious feature engineering
Easy for backend systems to get data needed to serve predictions — all recipe information can be compressed into a vector of floats from a single source
Less feature engineering = less code to maintain = easier and faster prototyping/ deploying algorithms into production

How Gousto today uses embeddings

We currently train 5 embeddings within our recommendation system. These embeddings however are trained on users’ interactions (past orders) data. As a result, it is not an objective description of what a recipe is but rather describes which recipes are often bought together.

For example, many users would buy a burger and curry together but never 2 burgers or 2 curries in a single order. As a result, burger and curry would appear to be ‘similar’ but 2 curry dishes would be ‘different’, which is not necessarily the behaviour we want for other downstream applications!

Therefore we need another way to train recipe embeddings that can capture an objective description of what recipes are, so that these embeddings could be used for many different downstream applications.

Our initial attempts at training an objective recipe embedding

Using GloVe Pre-trained embeddings

Our first method of creating recipe embeddings was using GloVe. However we found that this didn’t produce great results, probably because GloVe is trained on the entire English vocabulary and all food words are too close together in the embedding space.

Our own Word2Vec embeddings on Gousto’s Recipe Library

So we needed a bespoke model. First we tried training our own Word2Vec model using recipe ingredients. We fed in the list of ingredients as input, and trained the model to learn the context around each word, using a sliding window technique. To help the model learn the contextual information better, we ordered the ingredients in the order that they appear in the cooking instructions.

Next we needed a way to aggregate the ingredient embeddings into a recipe embedding. This can be done in several ways, such as simple mean or term frequency–inverse document frequency (TF-IDF). TF-IDF down-weighs contributions of words that occur very frequently, so that the ingredient ‘garlic’ does not contribute as much to the recipe embedding compared to a niche ingredient such as ‘gochujang’.

How can we combine the ingredient embeddings to a single vector?

However there are limitations with Word2Vec + TF-IDF:

It doesn’t allow the model to learn any non-linearity that exist between ingredients. For example, the word ‘cumin’ can have very different meaning depending on which word it’s surrounded by; if it appears with paprika, it’s likely to be in a Mexican dish, whereas if it appears with curry powder, it’s likely to be an Indian dish!
It is limited to looking at the context around each ingredient, rather than considering the entire list of ingredients together.

So, we needed a way to convert vectors of individual tokens (ingredients) into a vector of ‘documents’ (recipes) that

is permutation invariant — same list of ingredients in a different order gives the same result!
captures non-linear interactions between the tokens in the list

Enter the Set Transformer

What are Set Transformers exactly, and how are they relevant to our problem of creating recipe embeddings with context? Before we dive into that we need to briefly go over transformers, and why they are important. First described in Attention is All You Need (Vaswani et al.), transformers became the gold standard for NLP tasks due to their ability to accurately represent contextual information in a scalable way (read the blog post The Illustrated Transformer for more details!).

Attention and Self-Attention Blocks

The new mechanism that transformers introduced and that separated them from previous neural network architectures is known as attention. The name comes from the network’s ability to pay attention to different parts of the data depending on the context. There are two types of attention blocks that are implemented in a conventional transformer architecture:

Encoder-Decoder Attention — this type of attention teaches the network, which parts of the input data to look at, to help predict the desired output of the task. It is typically useful when we have supervised learning tasks, like for example a machine translation task to translate a sentence from one language to another — in that case we can think of it as understanding the context within the input sentence to help better predict what the translation would be.
Self-Attention — this type of attention teaches the network context within the given input data. It is a special case of the attention described above in which the input data and the output data is the same. It is typically used for unsupervised training tasks, e.g. teaching a computer language, for the purpose of text generation later (think GPT-3)

Note that in supervised learning tasks, Self-Attention blocks are still present in both the encoder and decoder of the transformer on top of having the Encoder-Decoder attention block described above.

Differences between Set Transformer and Transformer

At its core, the Set Transformer (Lee et al.) has two distinguishing features from the transformer architecture:

We don’t have separate starting and target sequences — unlike the translation task, since we’re just trying to learn the relationship between items in the embedding space, the input and target sequences are the same. In other words, all attention blocks are self-attention blocks, both in the encoder and decoder.
Set Transformer deals with sets instead of sequences — sets are permutation invariant unlike sequences, meaning the value of a set would still be the same even if we get an output with different ordering of the items. This property is represented in the architecture through an additional pooling layer (also an attention block) used in the decoder . We will not overbear the reader with the details of how the Pooling by Multi-head Attention (or PMA) block works, and instead urge those who are interested to refer to the original paper.

Visualisation of what part of a single self-attention block might look like. Darker shades represent words to which “cumin” is paying attention to in order to predict the cuisine. The presence of ingredients like ‘tortilla’ suggests Mexican over Indian.

Putting it all together

We have all the elements now to define the final architecture of the Set Transformer that we used for our task. Expressed mathematically:

where, SAB is a Self-Attention Block, PMA is the Pooling Multi-head Attention block, rFF is a feed-forward layer, X is our input for which we used the set of ingredients of a recipe, and Z is the output of the encoder.

At a high-level, the encoder uses self-attention to map the relationships between the set of ingredients of a recipe to a feature space Z. The decoder then uses pooling, more self-attention blocks, and a feed forward layer to aggregate the feature space back into a single vector which we can further use to train the embeddings on a classification task.

Training Recipe Embeddings with Multi-Task Learning

To train the embeddings using the Set Transformer, we use a combination of the Set Transformer architecture along with multi-task learning layers, which learn to classify recipes by its attributes, by:

Encoding and padding the ingredient lists
Passing these to an embedding layer, followed by the Set Transformer architecture
Flattening the output and feeding into several fully-connected layers, where the output of the last layer would become the recipe embeddings
Splitting into multiple task-specific layers, each consisting of several fully-connected layers which aim to predict one recipe attribute each, such as protein or cuisine

Multi-task learning ensures that the Set Transformer and linear layers learn to capture all the characteristics that make up the recipe; i.e. produce a generalised representation of recipes, which is exactly what we want in recipe embeddings! As the network tries to optimise the multiple loss functions at once, we are able to train the weights in the Set Transformer and linear layers to produce these embeddings.

The advantages of our method are:

Once trained, we can simply extract out the Set Transformer and shared linear layers to produce embeddings for any recipe, as long as we know its ingredients. This means we can infer characteristics of the recipes before they are out in the real world!
We could put more emphasis on certain attributes by increasing their weights for losses, using domain knowledge, if we knew for instance that protein is more important to customers than dish type.

Visualising Recipe Embeddings

So how do we know if these recipe embeddings actually learned anything? We could project the embeddings into a 2D plane and colour-code it by protein labels for example to see what it’s doing. The idea is that similar recipes should be close together in the embedding space. Play around with the plot to explore our different recipes in the embedding space!

Feeding Recipe Embeddings into our Recommender System

To evaluate how good embeddings are, we’ve fed it to our recommender ( a downstream application) and found that:

GloVe pre-trained embeddings performed the worst as we expected. TF-IDF using Gousto Word2Vec wasn’t much better.
Set Transformers, whilst not beating current recommender in production, is still very close in performance to our production system.

In addition, using Recipe Embedding trained outside recommender also reduced training time by 30%. For context, our recommender takes about 10 hours to train on full production data.

This is really promising because we could reinvest the training time saved into training longer epochs, develop a more complex network, or feed in even more data to increase our recommender’s performance!

So what’s next for Recipe Embeddings?

So far we’ve only used ingredients to construct recipe embeddings and it’s already giving really promising results. We’ve not used recipe images, cooking instructions or ingredients quantities yet! We think by using these additional data, we could teach a computer other things such as convenience, difficulty, time to cook, food textures (fried vs. soupy) which could make recipe embeddings even more predictive. These data sources however come with other challenges, e.g.:

How do we represent ingredients quantity in a meaningful way (1 tsp of cinnamon vs 1 pack of crème fraîche)
How do we combine data sources of different types (images + text) into a single embedding space?

In addition to our recommender, there are many places where we could also use Recipe Embeddings at Gousto:

Forecasting (which we already currently do)
Menu Planning — optimising for menu cost, choice and other business constraints based on recipe attributes
Recipe Development — predicting performance of new recipes based on how similar they are to existing ones

So stay tuned with what’s happening in this space!