In the previous article, we described how to preprocess a dataset of recipes. The next step is to enrich its data with extra information. The goal? The construction of a food-entity tagged dataset to exploit in several and different Information Retrieval (IR) tasks.
Who works on ML projects should have no doubt about the utility of such a resource. Indeed, the unavailability or scarcity of training data is one of the most serious challenges in ML and specifically in NLP. A problem that gets harder when the data you need has to be labeled. Tagging data is indeed a cost and time-consuming process, in many cases the cause of AI project slowdown.
Scaled to our case, suppose we want to perform the NER (Named Entity Recognition) task to classify food-related entities. We suddenly realize that this is quite impossible since many off-the-shelf NER models have not been trained to recognize food entities. In addition to this, there is not any dataset labeled with such entities.
How to make up this lack? Do we have to go through the exciting (allow us a bit of irony) annotation process? Yes, we actually do! But do not put you off: we devised a smart method to simplifying the labeling task. The remainder of the article will describe it in detail.
Before starting, we set some ‘rules’ to follow during the tagging process. The first sounds like:
Tag what you want, but tag it coherently
that means try to adopt the same strategies in tagging the same entities. This contributes to avoid model fluctuating performances and the risk of losing information.
Following this maxim, we, for example, consider ingredient:
- any food involved in the process of dish creation, as long as it is i) among the ingredients of the recipe and ii) already made.
So pastry is an ingredient if it is used ready-made in the recipe of the roasted vegetable Tarte Tatin, but it is not if the recipe describes how to prepare the pastry.
What about ingredient attributes? Are they part of the ingredient itself? This basically depends on the level of specificity you want to reach. We decided to consider parts of the ingredient those attributes characterizing or distinguishing it from a very similar one. For example, sweet and hot in sweet sausages and hot sausages. The same is not for those attributes implying a physical transformation of the ingredient, as chopped and sliced in chopped tomatoes or sliced bread.
To tag recipes, we chose a variant of the IOB schema, where B-, I- tags indicate the beginning and intermediate positions of entities. O is instead the default tag. The entity to tag is ingredients, quantities, and units of measurement, the labels INGREDIENT, QUANTIFIER, and UNIT.
TagINGR: a semi-automatic tool for tagging ingredient in recipes
TagINGR is the tool we developed to tag ingredients in recipes. The principle of the approach is pretty simple:
- matching items in the recipes with those in a list;
- adding the tag INGREDIENT when the item is both on the list and in the recipe.
The most bothersome part of the approach is the construction of an as complete as possible list of ingredients. To do that, we first extracted ingredients from the initial database and then enriched the list adding some variants (extravirgin olive oil, extra-virgin olive oil, extra virgin olive oil).
Built the list, we wrote a Python code to match ingredients in recipes and assign them a tag.
The TagINGR code
We first sorted the list of ingredients by descending length. This represents a crucial prerequisite for the next steps since the match happens between a recipe ingredient and its first occurrence in the list. Hence, if the list is unsorted and, for example, the ingredient ‘chicken’ is before ‘chicken broth’, any occurrence of ‘chicken broth’ in recipes will be tagged as (a) instead of the correct (b):
In Part 2, we defined the function (def recipe_tagger) with its parameters: the language, the ingredient list, and the recipe texts. The function:
- tokenizes (see the previous article for tokenization) ingredients in the list of ingredients;
We decided to limit the length of the ingredient to eleven tokens, to reduce errors.
- declares the variables ingredient (ingr) and tag of the ingredient (ingr_tag), either if the ingredient is one-token long (1) or contains more than one token (2);
In Part 3, we tagged recipes:
- the code searches recipe texts to find the ingredient of the list and:
- if the ingredient is one-token long and its PoS tag is NOUN (NN) (see the previous article), the string is tagged as in Part 2 — point 1;
- in the case of multi-token ingredients, the string is tagged with Part 2 — point 2.
- the ingredient string is cleaned: the PoS substring is substituted with a blank space.
As you noted, we made use of the grammatical level provided by PoS tagging: it indeed helps in disambiguating some words, like flour or toast which are ingredients only if they are also nouns.
If you are working with a huge amount of data, our advice is to start tagging them a little at a time. This because you can easily check the results and, in case some ingredients are not on the list, adding them to it. When you will launch the code again also the missing elements will be tagged.
Tagging quantifier and units of measurements
Knowing what is an ingredient can facilitate the labeling of quantifiers and units of measurement. To tag these entities, we usually:
- individuate a list of chunks (structures) containing the three entities (the quantifier or the unit can be optional, while the ingredient cannot );
- create a list of units of measurement (while quantifiers are usually digits, cardinal nouns, and articles);
- compile a list of ad hoc regex to tag the structures.
The final output will be:
The manual effort of the approach basically depends on the completeness of the ingredient list. Note that our idea of ingredient is wide enough to include its nationality, particularities of its taste, compliance with diets, and so on. In the list, you end up finding, for example, the ingredients yogurt, Greek yogurt, low-fat yogurt, plain low-fat Greek yogurt, and so on. If you want a shallow classification, you can easily fill in the list extracting ingredients from an on-line lexical database.
In this case, time and effort reduce impressively.
The Smart Recipe Datasets in numbers
At the end of the day, we have:
The NER models we trained on the dataset started to achieve good performance already with 6,000 tagged recipes!
The how and the why in the next post.
When Food meets AI: the Smart Recipe Project
a series of 6 amazing articles
Table of content