Cleaning and manipulating food data

When Food meets AI: the Smart Recipe Project

5 min readJul 2, 2020

Oscar Wilde said “I can’t stand people who do not take food seriously” and we totally agree with him. Food is one of the essential things we experience every day and not just because it is our main source of survival. Cooking recipes, videos, photos are everywhere on the web, which is today the greatest archive of food-related content.

But what if this big amount of data meets Artificial Intelligence? The Smart Recipe Project, born from the cooperation between the global media company Condé Nast, and the IT company RES answered this question developing AI services able to extract information from food recipes.
The leading idea is that such implementations might fit many and various business use-cases like recommendation engines, vocal assistant, smart kitchen tools, and much more.

If you are wondering how we got intelligent systems from nothing else than a lot of recipes, this and the next articles will present you with the most appetizing stages of our work.

Overview of the project

The Smart Recipe Project can be split up in three main stages:

PART 1: using NLP (Natural Language Processing) techniques, we enriched data tagging entities and adding entity-specific information;
PART 2: exploiting state of the art algorithms, like BERT and LSTM, we developed services able to automatically extract ingredients, quantities, nutritional values and other interesting information from recipes;
PART 3: adopting the Amazon Neptune technology, we built graph databases to store and navigate relationships among data.

In the following, we describe the steps to clean and get recipe datasets in two different languages, American English and Italian.

Let’s clean our data!

In the beginning…. there was only a messy mass of unstructured data.

Unappealing, but essential, the data-cleaning phase represents a crucial step for any AI or machine learning algorithm. Cleaning techniques mostly depend on the nature of the data and on the final dataset you want to get.

In our case, the data were stored in .tsv files each containing key-value columns with information about recipes:

At this stage, we wanted to get .txt files containing the recipe text. This piece of information can be found in the ‘content’ column of the database. To compensate for empty ‘content’ cells, we assembled recipes combining the value of the ‘ingredients’ and ‘steps’ cells.
Cells were extracted using Pandas, a Python library specifically devised to manipulate and analyze data:

Manipulate and analyze with Pandas

The above function takes as arguments: the three columns we want to extract and the index of the first and last dataframe cell. The if-else condition outputs the value of the ‘content’ cells or, if they are empty, the combination of the respective ‘ingredients’ and ‘steps’ cells.
To get rid of tags, HTML entities, non-alphabets, and other stuff that are not part of the language, we used a set of ad hoc regular expressions. We stored regex in a list of tuples, where the first element represents the string to be searched and the second the string to be substituted:

The regex

The blue tuple, for example, adds a space between some special symbols, easing their following elimination. The two purple regexes instead handle some typing errors: they separate the alphabet from numeric characters and the opposite in words like ‘water6’, ‘6water’.
The list is then used in the function:

The list was edited as we executed the further preprocessing steps: some regex was integrated, for example, to cover cases not treated by preprocessing, others deleted, since redundant. It is the case of the last regex in the list which, firstly introduced to tokenize strings like ‘4 Ã° f’, was eliminated as we modified the tokenizer.

The following steps consist of tokenizing sentences and words in recipes. We used NLTK (Natural Language Processing Toolkit) a Python library studied to manipulate textual data, widely used in the NLP field.

In the above code, we first imported the library and then created a function that takes the recipe texts as an argument. The sent_tokenize module outputs a list of split sentences. Its parameters are the strings to split and their language. The module is an unsupervised algorithm that builds a model for abbreviations, collocations, words that start sentences, and then identifies sentence boundaries.

Sentences were tokenized with the MWETokenizer module. We used a multi-word-expression tokenizer instead of the most common word tokenizer since we noted that it failed in tokenizing some strings. Let’s look at how the two tokenizers treat the string ‘1⁄2 cup milk’:

output word_tokenizer:                  output MWETokenizer:['1', '/', '2', 'cup', 'milk']          ['1⁄2', 'cup', 'milk']

The left tokenizer considers the fraction symbol ‘/’ as a single token. The second module instead correctly tokenizes the string, considering the whole fraction as a single token. This because it has been trained on identifying multi-word-expressions (in linguistics, 2 or more words behaving as a single unit). In a broad sense, the fraction can be considered a multi-word-expression.
At this stage, the dataset looks like this:

recipe_0                  recipe_12                         put
heaping                   the
tbsp                      dough
.                         on
of flour                  top
...                       of
                          them 
                          and 
                          flour 
                          it 
                          ...

Before moving on to the next stage, we provided some kind of disambiguation. We noted indeed that some words, though graphically identical, have different meanings. Let’s take the word ‘flour’ in the above recipes: it can mean i) the ingredient (‘the flour’) which is a noun, or ii) the act of spreading flour (‘to flour’) which is a verb. Only the first meaning is of our interest and will be tagged as an ingredient in the next stage. Since the two tokens have different grammatical roles (noun vs. verb), we exploited such features to disambiguate the cases. We used the NLTK pos_tag module which assigns each token in the dataset a Part of Speech (its grammatical role):

The code first recalls the clean_recipe and tokenize functions and then tags each token in the list of tokens. The output will be a list of words with the respective grammatical tag:

recipe_0                  recipe_12         CD              put         VB
heaping   JJ              the         DT
tbsp      NN              dough       NN
.         .               on          IN
of        IN              top         NN
flour     NN              of          IN
...                       them        PRP$
                          and         CONJ
                          flour       VB
                          it          PRP
                          ...

In the next article, we will show you how we labeled food-related entities in the recipe dataset.