Paper Review: Learning Cross-modal Embeddings for Cooking Recipes and Food Images

Published in

Multi-Modal Understanding

4 min readJun 5, 2022

In this article, I will review Learning Cross-modal Embeddings for Cooking Recipes and Food Images.

Salvador et al. [1] propose a feature extraction method that composes textual and visual information for recipes. They also introduce a new dataset, Recipe1M. The dataset consists of recipes that actually contain 3 components: images, ingredients, and instructions. High-level pipeline is

1. Encode ingredients with LSTM
2. Encode instructions with another LSTM
3. Concatenate them and obtain 1 textual embedding
4. Extract visual features of the image with Resnet50 or VGG16
5. Feed visual and textual embeddings to different fully connected layers.
6. Calculate losses.

Dataset

The authors introduce a new dataset which is called Recipe1M. The key components of the recipes are

Images: food images of corresponding recipes. Images usually show the final state of the recipe, taken from the top and closely. Foods may be in a saucepan (I see this state mostly.) or in a dish. Some images may not be clear because they are taken so close.

Ingredients: An example for Mac and Cheese is given below (both ingredients and instructions).

6 ounces penne
2 cups Beechers Flagship Cheese Sauce (recipe follows)
1 ounce Cheddar, grated (1/4 cup)
1 ounce Gruyere cheese, grated (1/4 cup)
…

Instructions:

1. Preheat the oven to 350 F. Butter or oil an 8-inch baking dish.
2. Cook the penne 2 minutes less than package directions.
3. (It will finish cooking in the oven.)
4. Rinse the pasta in cold water and set aside.
…

In images, we may not see all ingredients. We can’t track instructions. Hence we have to carefully design a learning pipeline that model can learn what can a person do with these ingredients and steps.

Dataset Statistics

Approach

An overview of the proposed approach is given above. We can divide the pipeline into steps as follows

Ingredient keywords are extracted before training. For example, 2 tbsp of olive oil is converted to olive_oil. A bidirectional LSTM is trained to extract ingredient names. They don’t discuss too much about this model but mentioned that the accuracy of the ingredient name extraction model is %99.5. They label a small portion of the data for this task.
After extracting ingredient names, they obtain word2vec representations of them.
Authors believe that a single LSTM can’t learn from the whole recipe sequence because of the long lengths. Hence they use a two-stage LSTM. First, sentences are encoded with the first LSTM. Outputs of the first LSTM are called skip-instructions. Then skip instructions are fed to the second LSTM.
The output of 2 LSTMs (LSTMs of skip-instructions and ingredient names) is concatenated. In this way, we obtain a textual embedding.
In parallel, the recipe image is given to a backbone which can be Resnet50 or VGG16. They remove the softmax layer of the networks and take the last layer outputs as visual embedding.
At this point, joint learning starts. Both semantic and visual embeddings are fed to 2 different fully connected layers for projecting into recipe-image joint space. In this step, embeddings are wanted to make as much as closer to each other.

Training

After joint learning, cosine similarity loss is calculated as follows.

During training, positive and negative recipe-image pairs are used. In addition to loss, authors add a semantic regularization to the loss for better alignment between image-recipe pairs. They believe that if different modalities share weights during training, learning representations become easier.

Implementation

Code is publicly available here.

References

[1] Salvador, Amaia, et al. “Learning cross-modal embeddings for cooking recipes and food images.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[2] https://github.com/torralba-lab/im2recipe-Pytorch