5 Minute Paper Explanations: Food AI Part I
Intuitive deep dive of the im2recipe paper “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”
Introduction to the Problem
Welcome to my new series of paper explanations! You may ask what’s different from what’s out there? Well, I will not be covering the trendy or the famous papers, but instead picking up sub-(sub)domains in machine learning research and charting their research progress through these articles.
Let’s get started — today, we will learn about the paper (published in 2018) that introduced a huge dataset for research of machine learning in food and introduced the im2recipe retrieval problem.
Problem Statement: Given an image of food (imagine your favourite food here!), retrieve the recipe for making that food. To be able to do so, make model(s) learn the corresponding image and recipe embeddings, transform these embeddings to a shared joint embedding space and minimize a loss that measures retrieval performance of the recipe, given the image. Eezy-peasy, right?
An aside for people who do not know what cross-modality is: a modality is a type of source of information that we feed into the model. It can be video, image, text, audio, etc. In im2recipe we have image and text. So, we work with two modalities and go from one to the other, hence cross-modal.
Domain Background and Improvements
There has been previous research crossing food with AI or machine learning. A prominent example with a big dataset is Food-101 and the paper that introduced it. This was a classification problem over food images. The im2recipe paper by contrast introduces a retrieval problem with a much richer dataset.
If we see the classes in the Food-101 dataset, it just contains information about the food name. In the im2recipe dataset called Recipe1M, we not only have a higher number of data points, but those data points include food images, titles (names), instructions on making that dish and ingredients.
One advantage of this rich data is that we are now able to have sort of a step-by-step information on the state of the recipe through its instructions and ingredients. We can know whether an ingredient is raw, baked, fried, etc. at any stage and theoretically be able to learn corresponding different images at these stages with a good model.
Dataset
The huge Recipe1M dataset contains 1 million recipe ingredients and instructions and 800 thousand corresponding images of those prepared recipes extracted and downloaded from cooking websites and organized into JSON format files.
In the text modality, we have a title, a list of ingredients, and a sequence of instructions for preparing the dish. In the second modality, we have any images with which the recipe is associated in RGB/JPEG format.
Some analysis the authors have done on the collected dataset:
- The average recipe in the dataset consists of nine ingredients which are transformed over the course of ten instructions.
- Exact duplicates of recipe-image pairs have been removed
- 4,000 of the 16,000 ingredients identified account for 95% occurrences of ingredients in the data
Architecture
As mentioned earlier, there are two modalities and four types of data: title, ingredients, instructions and images. We embed each of them in a different way.
As shown below, ingredients are first extracted from the ingredients list in the data using bidirectional LSTM and then encoded to a word2vec representation.
Each of the instruction element is encoded using the skip method as used in word2vec where each instruction has the one before and after it as a target and the encoding is optimized to improve the prediction of these adjacent instructions. The modification made here is that start and end instruction tokens are introduced as well. Next, these individual instructions encodings are passed through a LSTM to produce instruction embedding.
Next, these instruction and ingredient embeddings are concatenated and transformed via an affine transformation (linear transformation with learnt weights and bias) to a shared embedding space. The same is done for image embeddings which the are the features extracted from a pre-trained ResNet-50 / VGG-16.
Semantic Regularization
In addition to the above primary architecture, the authors also perform regularization by making the joint embedding model (with shared weights) learn to classify any recipe image or text information into one of the Food-101 classes. Intuitively, this ensures that the model roughly knows some embedding clusters (in terms of these classes) already and it does not try to make one cluster for each image-recipe pair, and hence overfit. Because the weights are shared between the two modalities, it also ensures that the image-recipe pair embeddings learnt are aligned to each other (we do not get individual clusters for image and text information for the same dish).
However, the authors of the paper found that the Food-101 classes only covered 13% of the Recipe1M dataset and so, they augmented it with 946 most frequent bigrams in recipe titles from the Recipe1M training set (after cleaning). Doing this, they achieved a coverage of 50%
Loss Function
The loss function that is used is cosine similarity with margin loss. The training is done in a pairwise fashion similarly to how triplet loss function is trained — with positive and negative image-recipe pairs. This training method and the above loss function ensures that the image-recipe embeddings of the positive pairs (y=1) are brought closer together (the difference closer to 0), and the negative pairs are pushed farther apart with a fixed margin α in between.
Addition of Semantic Regularization
With the regularized loss added to get the combined, we can intuitively see the learning process as bringing the image-recipe embeddings of the positive pair together, pushing the image-recipe embeddings of the negative pair farther away while all the time keeping the image-recipe pair embeddings aligned to each other through shared weights and classification. Specifically, if cᵣ and cᵥ (the class predictions for image and recipe embeddings, respectively) are different than the regularization loss will increase, else it will decrease.
Experiments and Results
This is the way the results are reported: On a subset of randomly selected 1,000 recipe-image pairs from the test set. We repeat the experiments 10 times and report the mean results. We report median rank (MedR), and recall rate at top K (R@K) for all the retrieval experiments.
If I am retrieving 100 recipes for each image, then median rank is the median of the position the correct recipe came in per pair per experiment. Similarly, R@5 in the im2recipe task represents the percentage of all the image queries where the corresponding recipe is retrieved in the top 5, hence higher is better.
The baseline result is the result using CCA. As an aside, CCA is a technique much like PCA but for multiple sets of variables. Canonical correlation analysis determines a set of canonical variates, orthogonal linear combinations of the variables within each set that best explain the variability both within and between sets. The detailed results and ablation studies can be seen in the paper, and are pretty self-explanatory
What I want to focus on here is the analysis of the embeddings. The authors presented using vector arithmetic that the embeddings that the model learns make semantic sense. An example is shown below.
The authors also extract and visualize local unit activations using the top activating images, ingredient lists, and cooking instructions for a given neuron, and focusing on specific image and text regions that contribute the most to the activation of these units. Results below
This is part of a new series I am starting on intuitive paper explanations. I will be picking a subdomain in the industry and going through papers in that domain. If you like what I write, consider subscribing or following me here or connecting with me on Linkedin or Twitter! For codes pertaining to my previous article visit my GitHub