NER for all tastes: extracting information from cooking recipes

When Food meets AI: the Smart Recipe Project

8 min readJul 20, 2020

In the previous articles, we constructed two label datasets to train machine learning models. The aim is to develop systems able to interpret and extract information from recipes. We categorized these systems into four main categories: extractors, classifiers, regressors, and searchers.

The post explores the extractor services used to extract information (ingredients, quantities, time of preparation, etc) from recipes. To achieve that, we used Named Entity Recognition (NER).

What you will find in the article:

NER: the task and its main applications.
An overview of NER approaches.
NER for the Smart Recipe Project.
Results and next steps.

What is NER?

NER is a two-step process consisting of a) identifying entities (a token or a group of tokens) in documents and b) categorizing them into some predetermined categories such as Person, City, Company… For the task, we created our own categories, which are INGREDIENT, QUANTIFIER, and UNIT.

NER is a very useful NLP application to regroup and categorize a great amount of data which share similarities and relevance. For this, it can be applied to many business use cases:

Human resources. It speeds up the hiring process by summarizing and skimming applicants’ CVs.

Customer support. It reduces response times categorizing user requests, complaints, and questions.

Search and recommendation engines. It improves the speed and relevance of search results and recommendations.

Content classification. It easily classifies document by content, highlights trends by identifying the subjects and themes of posts and news articles.

Health care. It improves patient care standards and reduces workloads by extracting essential information from lab reports.

Education and Academia. It allows students and researchers to find relevant material faster by summarizing archive material and highlighting key terms, topics, and themes.

Before jumping into our models, we propose a general overview of NER approaches.

NER approaches

NER approaches differentiate into:

Classical approaches. They are mainly rule-based. These NER systems rely on hand-crafted syntactic, lexical, and semantic rules.

Feature-based approaches. In these cases, annotated data were used as training examples to extract features. Then ML-based algorithms were trained on these data to learn a model able to recognize similar patterns from unseen data. Common ML-based algorithms are Hidden Markov Models (HMM), Decision Trees, Maximum Entropy Models, Support Vector Machines (SVM), and Conditional Random Fields (CRFs, see below).

Deep learning approaches. In recent years, DL-based NER models become dominant, achieving state-of-the-art results. Conversely to feature-based approaches, deep learning methods are able to discover hidden features automatically learning representations from raw or partially raw data.

The strength of DL-based methods lies in a) learning via non-linear activation functions; b) implying less effort on designing NER feature engineering; c) working on an end-to-end framework by gradient descent.

NER for the Smart Recipe Project

For the Smart Recipe Project, we trained four models: a CRF model, a BiLSTM model, a combination of the previous two (BiLSTM-CRF), and the NER Flair NLP model.

CRF model

Linear-chain Conditional Random Fields (CRFs) are a very popular way to control sequence prediction. They work approximating the conditional probability of the output label sequence, given an input word sequence. CRFs are discriminative models able to solve some shortcomings of the generative counterpart. Indeed while an HHM output is modeled on the joint probability distribution, a CRF output is computed on the conditional probability distribution.

In poor words, while a generative classifier tries to learn how the data were generated by estimating the assumptions and distributions of the model, a discriminative one tries to model just observing the data. It basically makes fewer assumptions on the distributions.

In addition to this, CRFs do not assume that labels are independent of each other. Indeed, they take into account the features of the current and previous labels in a sequence. This increases the amount of information the model can rely on to make a good prediction.

The feature function of a CFR model looks like:

Where:

X is the set input vectors
i is the position of the data point we are predicting
li-1 is the label of data point i-1 in X
li is the label of data point i in X

The feature function expresses some characteristics of the data point and the context.

For instance, if we are using CRF for NER:

f (X, i, L{i — 1}, L{i} ) = 1 if L{i — 1} is B-ENTITY, and L{i} is I-ENTITY. 0 otherwise.

And

f (X, i, L{i — 1}, L{i} ) = 1 if L{i — 1} is O and L{i} is B-ENTITY. 0 otherwise.

Each feature function bases on the label of the previous word and the current word, and it is either a 0 or a 1 (For more Maths, take a look at this article).

For the task, we used the Stanford NER algorithm, which is an implementation of a CRF classifier.

Although CRFs outperform other systems as concerns the accuracy, they cannot understand the context of the forward labels, (which plays instead a crucial role in sequential tasks like NER). This plus the extra feature engineering involved in the training makes them less appealing for the industry.

BiLSTM with character embeddings

Going neural… we trained a Long Short-Term Memory (LSTM) model, which represents the state of the art as concerns sequential tasks. LSTM networks are a type of Recurrent Neural Networks (RNNs), except that the hidden layer updates are replaced by purpose-built memory cells. As a result, they find and exploit better long-range dependencies in the data.

To benefit from both past and future context, we used a bidirectional LSTM model (BiLSTM), a particular type of LSTM which processes the text in two directions: both forward (left to right) and backward (right to left). This allows the model to uncover more patterns as the amount of input information increases.

Fig. 2 sketches the main components of a BiLSTM network:

x stands for the input sentence;
h represents the output sequence for backward and forward runs;
σ concatenates backward and forward elements;
y stands for the output

Instead of only considering word-level representations, we incorporated character-based word representation. Character-level representation exploits explicit sub-word-level information such as prefix and suffix. Since it naturally handles out-of-vocabulary, it is able to infer features for unseen words and share information of morpheme-level regularities.

NER Flair NLP

This model belongs to the Flair NLP library developed and open-sourced by Zalando Research. The strength of the model lies in a) the use of state-of-the-art character, word, and context string embeddings (like GloVe, BERT, ELMo…), b) the possibility to easier combine these embeddings.

Contextual string embeddings help to contextualize words. These embeddings use certain internal principles of a trained character model to capture word meaning in context and therefore produce different embeddings for polysemous words (same words with different meanings). Moreover considering words and context fundamentally as sequences of characters, it eases the handle of rare and misspelled words.

To better understand what we are talking about, let’s look at the following sequences:

a) Add six oz white flour.
b) Flour the tray.
c) Use the potato masher, to peel them.
d) Peel one sweet potato.

A model fed with contextual string embeddings should be able to tag as INGREDIENT only to the occurrence of flour in a) and potato in d).

BiLSTM-CRF

Last but not least, we tried a hybrid approach. We added a layer of CRF to a BiLSTM model. The advantages (well explained here) of such a combo is that this model can efficiently use both 1) past and future input features, thanks to the bidirectional LSTM component, and 2) sentence-level tag information, thanks to a CRF layer. The role of the last layer is to impose some other constraints on the final output. The constraints could be:

The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
“B-label1 I-label2 I-label3 I-…”, have to follow this pattern.
“O I-label” is invalid.

These constraints might help in reducing the number of invalid predicted label sequences.

An eye on performance

The performance of the four models is shown in the table below. Here the F1 scores of the initial (B-ENTITY) and intermediate (I-ENTITY) tags are merged:

The final output is a JSON file containing the entities with their labels:

We are currently working to further improve these results!!!

In the next article…

We will present you the Smart Ingredient Classifier, a classification system built on BERT architecture able to assign the ingredients their taxonomic class. Read the article, to find out its implementation and the main applications.