FoodGraph: Loading data and Querying the graph with SPARQL

When Food meets AI: the Smart Recipe Project

Conde Nast Italy
6 min readAug 7, 2020
Did you ever try a Maritozzo?

In the past post, we converted the recipe data, stored in JSON files, into RDF triples. In this post, we show you:

Why connecting recipe data?

We already talked about the potentialities of connected data, but in practice, what can FoodGraph be used for?

Today we are overwhelmed by online recipe archives where we can easily find recipes that fit our requirements. However, connecting recipe data under a graph database structure may considerably improve the user experience. The range of applications buildable upon such technology is, for example:

  • Question answering systems (QA), able to answer natural language questions like “what recipe can I make with potatoes and ham?”;
  • Recommendation engines which make recommendations on food items, recipes, but also plan personalized menus with the combination of many recipes into complete meals;
  • Connected net of knowledge, where recipe/cooking data is linked to online archives coming from other organizations. So let’s imagine, connecting recipe data with those relative to health, travels, social trends, and much more.

The SPARQL query

Before diving into the specificities of the building graph process, we show you the general structure of a SPARQL query. SPARQL is an RDF query language, namely a semantic query language for databases, able to retrieve and manipulate data stored or viewed in the RDF format. We used SPARQL to accomplish two main goals:

  • checking the graph and inserting new data;
  • extracting knowledge from the graph.

A SPARQL query comprises:

  • Prefix declarations, for abbreviating URIs.
  • Dataset definition, stating what RDF graph(s) are being queried.
  • A result clause, identifying what information to return from the query.
  • The query pattern, specifying what to query for in the dataset.
  • Query modifiers, slicing, ordering, filtering, and other ways of rearranging query results.

In the query above, we are asking to find the homepage of anyone known by Tim Berners-Lee. The elements preceded by “?” are the unknown variables we want to find (you can discover more on SPARQL and RDF triples reading the previous articles).

Loading data on Amazon Neptune

We followed the described procedure to load the RDF triples on the Amazon Neptune service.

We used an Amazon Simple Storage Service, the Amazon S3 bucket. It is an object storage service, which grants industry-leading scalability, data availability, security, and performance.

Firstly we created an S3 bucket; then we uploaded the data. The Amazon Neptune Load API needs a particular data format. Among the possible, we chose the Turtle (turtle) format for RDF / SPARQL.

In this first phase, we loaded the RDF data to build the first level of the graph (see the previous article): the recipe id, the recipe in string format, the recipe language code, and the language in string format.

In the case we want to add a few recipes at the time, we can alternatively use the SPARQL statement INSERT DATA :

Integrating the extractor and classifier services within the graph

Once the recipes have been loaded, we checked whether there are recipes not yet processed by the extractor and classifier services. This means to check which recipes have not

i) food entity chunks extracted (the bnode in the graph, see the previous article);

ii) ingredients classified.

The SPARQL query below checks whether bnodes exist in the graph (through the statement FILTER NOT EXISTS), and therefore if recipes have been processed by the extractor system. This is equivalent to say “return all the recipes without bnodes”.

The output will be the id of the recipes which have not been processed yet.

Once the NER model completes its work, we obtain JSON files with the entities extracted and the model date. As viewed in the previous article, the data in the JSON constitutes the node of the second level of the graph. These data will be consequently converted into RDF triples and loaded on Amazon Neptune via S3.

The same procedure of load-check-process is also executed for the classifier service.

Extracting knowledge from the graph via SPARQL

Now the graph is on Amazon Neptune. The nodes are connected to other nodes via properties and together form a knowledge base for the cooking and food domain.

Let’s have fun of these connections, extracting knowledge from the graph:

With the above query we interrogate the graph to know 1) whether there are recipes containing the ingredient “butter” and 2) which are these recipes. The WHERE statement navigates the graph following the pattern described in the triples to arrive at the query result. In this case, the output is the id of the recipes which have the ingredients ”butter”.

We can query the graph to return recipes containing more than one ingredient or all the recipes containing some ingredients and not others:

To be noted that the unknown variables have to have different names (ingr_id1, ingre_id2, bnode1, bnode2). The “GROUP BY” statement sorts the recipe in order to occur only once, when the searched ingredient is repeated more times in the recipe.

The Smart Recipe Project: what has been done, what can be done

With this last article, we conclude illustrating the main stages of the Smart Recipe Project, this innovative and amazing project involving on one side the global company Condé Nast, and on the other the IT company RES.

We started with raw data consisting of digitilized recipes from some Italian and abroad famous online cooking magazines like Epicurious, La Cucina Italiana, and Bon Appétit.

We then enriched these data, labeling some entities (ingredients, units of measurement, quantifiers), and identifying the taxonomic class of ingredients. We obtained thus two datasets for training ML and DL models for performing two tasks: i) automatically extracting food entities from recipes (extractor system), ii) automatically individuating their taxonomic class (classifier system).

We finally decided to connect the data and the output of the systems under a graph database structure. The goal was the creation of a knowledge base, named FoodGraph, where the different recipe data information is connected together to form a deep net of knowledge.

We have in mind some possible interesting applications for the resources we developed under the Smart Recipe Project:

  • personalization of contents, personalized recipe searchers, newsletters;
  • recommendation systems for food items, recipes, and menus, which integrate, where needed, dietary restrictions;
  • virtual assistants, able to guide you in planning and cooking meals;
  • smart cooking devices, and much more.

--

--

Conde Nast Italy

Condé Nast Italia è una multimedia communication company che raggiunge un’audience profilata grazie alle numerose properties omnichannel.