Introducing 🥘 RecipeQA — A Challenge Dataset for Multimodal Comprehension of Cooking Recipes

Reading comprehension aims at building machines that can answer questions related to a given natural language text. As one of the most difficult and attractive problems in artificial intelligence, it requires joint understanding of the question and the context passage, and therefore, an effective reading comprehension model should learn to make sense of what a question asks and which parts of the provided text are most related to the question.

Although much progress has been made in the last couple of years, there is still a significant performance gap between humans and computers, and researchers are pushing forward our understanding of the limitations and capabilities of the existing approaches by the introduction of more complex datasets. These benchmarks mainly differ from each other by their question-answer formats, e.g. cloze, span selection or multiple choice, the text sources they use such as news articles, fictional stories and Wikipedia articles, and the comprehension skills that they test like temporal reasoning, coreference resolution and multi-step inference (Sugawara et al., 2017).

Despite the surge of new datasets, some researchers have questioned the difficulty of these benchmarks and showed that from a machine learning perspective the tasks might seem to be much harder than they actually are (Chen et al., 2016; Kaushik and Lipton, 2018). It has also been shown that the state-of-the-art models generally tend to learn some simple surface cues, and hence they can be easily fooled by artificially injecting some text that will match the patterns they expect (Jia and Liang, 2017).

The RecipeQA Dataset

In our recent paper, which will be presented at EMNLP 2018 conference, we introduce a new multimodal machine comprehension dataset named RecipeQA. Our benchmark is primarily focused on comprehending procedural knowledge in a multimodal setting where cooking recipes with accompanying images are used as testbed. It consists of approximately 20K recipes from 22 food categories, and over 36K questions.

Collected from an online how-to web site, each recipe that we employ in RecipeQA includes an arbitrary number of steps containing both textual and visual elements. To our benefit, the nature of the cooking recipes (they all consist of well-defined step-by-step instructions) helped us generate a large set of questions in a fully automatic manner without compromising the quality. Just look at the following recipe to see how structured they can get!

A recipe of `Creamy Coconut Chickpea Curry’ with 9 steps.

RecipeQA differs from existing reading comprehension datasets in a number of important ways. First, it leverages data from real natural language found online. Second, the use of the visual modality makes RecipeQA less gameable, preventing questions from easily answerable through shallow signals. Lastly, it involves a large number of images which are taken by ordinary people in unconstrained environments as opposed to the cases in other multimodal benchmarks, for instance, carefully drawn diagrams or textbook images in TQA (Kembhavi et al., 2017), comic strips in COMICS (Iyyer et al., 2017) or edited videos in MovieQA (Tapaswi et al., 2016).

In particular, RecipeQA contains the following three core challenges unique to the cooking recipes and the multimodal aspects of the questions:

  1. RecipeQA requires identifying and linking entities from different modalities to possess effective and visually grounded reasoning skills. At the present, multi-modality has been explored only to a limited extent where most of the existing multimodal models mostly consider very simple strategies when integrating different modalities.
  2. To succeed in RecipeQA, a comprehension system needs to exploit common sense knowledge while working towards an answer. The questions in RecipeQA need identify entities (e.g. tomato) — building conceptual associations between parts of images and certain sections of the recipe, and track the states of these entities (e.g. roasted) in time.
  3. RecipeQA presents several different tasks specifically designed for cooking recipes. Described in detail below, each of these tasks evaluates a specific comprehension skill and handling all these multiple tasks within a single model needs a multitask learning setup.

Comprehension Tasks in RecipeQA

The first comprehension task, which we formulate as a Textual Cloze task, is in the same vein as other existing reading comprehension benchmarks, e.g. (Hermann et al., 2015; Hill et al., 2016), with the only difference being it is visually grounded. That is, the input passage additionally includes zero or more images aligned with each step of the recipe. Here, we form cloze questions from the titles of the steps where candidates are chosen adversarially by following a simple heuristic.

Let’s look at a sample textual cloze style question:

A sample textual cloze style question from RecipeQA.

In contrast to textual cloze task, our other three comprehension tasks, namely visual cloze, visual coherence, and visual ordering, all take the titles and descriptions of the recipe steps as the context passage, and the questions and answers involve only the visual modality. To succeed in these tasks, an a comprehension system needs to not only understand the relations between candidate steps, but also align and relate different modalities existing in the context and the answers, or understand the temporal occurrence of a sequence of recipe steps and infer temporal relations between candidates.

For example, let’s consider the following ‘Bacon Sushi’ recipe as the input context.

A recipe of `Bacon Suchi’ with 7 steps.

A visual cloze question tests a skill similar to that of textual cloze task with the difference that the missing information in this task reside in the visual domain. Here is an example:

A sample visual cloze style question from RecipeQA.

A visual coherence question, on the other hand, tests the capability to identify an incoherent image in an ordered set of images. Here is a sample visual coherence style question:

A sample visual coherence style question from RecipeQA.

Lastly, a visual ordering questions test the ability of a system in finding a correctly ordered sequence given a jumbled set of representative images of a recipe. And here is one such question:

A sample visual ordering style question from RecipeQA.


So far, we have only implemented some baselines. As a simple baseline, we adapt the Hasty Student model described in (Tapaswi et al., 2016), which does not consider the provided context and simply answers questions by only looking at the similarities or the dissimilarities between the elements in questions and the candidate answers. For our neural baselines, we modify the Impatient Reader in (Hermann et al., 2015), which is a neural model originally developed for the cloze style text comprehension questions.

The results that we obtained with these methods demonstrate the hardness of the tasks. We believe that RecipeQA will serve as a challenging testbed and an ideal benchmark for evaluating procedural knowledge in a multimodal setting. Here we should also note that the reason why the scores of the Hasty Student is way better than neural models is that candidate answers were selected in a similar way in generating the questions.

Results of the baseline methods on the test set of RecipeQA.

Clearly, what is missing in the above table is the human performance scores. We plan to perform experiments on humans to evaluate the difficulty of questions. Moreover, we expect to extend the baseline results with more powerful models. Right now, we are working on IR based QA approaches, which employ visual semantic embeddings for cross-modal retrieval. And of course, we encourage researchers to contribute their ideas to RecipeQA.

You can read our paper, explore the dataset, check out out the leaderboard at our project website at


Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In ACL.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In NIPS.

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The Goldilocks principle: Reading children’s books with explicit memory representations. In ICLR.

Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, and Larry Davis. 2017. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In CVPR.

Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.

Divyansh Kaushik and Zachary C. Lipton. 2018. How much *reading* does reading comprehension require? A critical investigation of popular benchmarks. In EMNLP.

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are you smarter than a sixth grader? Textbook question answering for multimodal machine comprehension. In CVPR.

Saku Sugawara, Hikaru Yokono, and Akiko Aizawa. 2017. Prerequisite skills for reading comprehension: Multi-Perspective Analysis of MCTest Datasets and Systems. In AAAI.

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In CVPR.