FoodGraph: a graph database to connect recipes and food data

When Food meets AI: the Smart Recipe Project

Conde Nast Italy
10 min readAug 3, 2020
Delicious Cuttlefish

All knowledge is connected to all other knowledge. The fun is making the connections.

We learned from the previous articles the potentialities of enriched data and how it can feed ML and DL models to develop intelligent systems. We moved a step further connecting the data and the output of the extractor and classifier services (see the previous articles) under a graph database architecture. Graph databases represent an innovative, powerful approach to solve the problem of connected data in a way that is closer to how humans think about data.

Connected data matters

We live in a world made of connections, where isolated pieces of information are not enough to create and model knowledge. The more the data is connected, the more the real value of it comes from relationships.

Under this perspective, graph databases allow storing, processing, and querying connections more efficiently, as well as to capture complex relationships in vast webs of information. Some of their common use cases are in the fields of social networks, recommendation systems, business relationships, geospatial applications, fraud detection, and much more.

Graph databases for the food domain

Condé Nast Italy and RES have implemented a graph database for the food domain. The goal is the creation of a knowledge base, named FoodGraph, where the different recipe data information is connected together to form a deep net of knowledge.

In this two-section post:

(SECTION 1): we will give you some insights about the concepts and technologies used in designing a graph database;

(SECTION 2): we show you our method for converting JSON files, containing the recipe data, into RDF triples, the data model we chose for constructing the graph.

Section 1: keys and concepts to build a graph database

What is a graph database? Graph databases are a NoSQL way to store and treat data and relationships among it, where relationships are equally important to data itself. In contrast to other approaches, which compute relationships at query time, the graph databases are initially designed to incorporate relationships since they store connections alongside the data in the model.

The building blocks of a graph database are:

  • Nodes or vertex → The discrete elements composing the graph. They are the constructs standing for real-world entities participating in relationships. They can hold any number of attributes or properties and can be tagged with labels identifying their role in the domain.
  • Edges or links → They represent connections and relationships among nodes and express the existing properties between the entities. Edges can be directed or undirected and may own some values such as a weight or a name.
Building blocks of a graph database (RDF graph)
Building blocks of a graph database (RDF graph)

RDF: a data model to build the graph database. RDF stands for Resource Description Framework (RDF) and is a data model that describes the semantics, or meaning of information.

RDF is a standard and uniform model for data interchange and connection, that allows data interchange even if the underlying schemas differ. It represents a foundation for publishing and linking data, not only in the Semantic Web (see below), but whenever high-quality connected data is needed. The advantage of representing data in RDF is that it allows any resource to be identified, disambiguated, and interconnected by machines and systems.

The core structure of an RDF model is a set of triples, each consisting of a subject, a predicate and, an object, which together form an RDF graph or a triples store.

Each RDF statement states a single thing about its subject (in purple) by linking it to an object (in red) by the means of a predicate (in green), the property.

In the example above, the triple states “The technical report on RDF syntax and grammar has the title RDF/XML Syntax Specification.”

There can be three kinds of nodes in an RDF graph:

  • URI nodes: URI stands for Universal Resource Identifier and is a string of characters used to identify a resource. The most common type of URI is the URL (Uniform Resource Locator), which is used to identify Web resources. URIs enable us to define a new concept, just by defining a URI for it somewhere on the Web. Moreover, they help machines in precisely identifying the entities referred to and how this particular data piece relates to other data pieces. This provides meaning and visibility since it allows anyone to use URIs for declaring its graph and maybe connect it to some others.
  • Literal nodes: these nodes are used for holding values such as strings, numbers, and dates. A literal node cannot be the subject of an RDF triple and therefore no connection can start from it.
  • Blank nodes: they represent anonymous resources, i.e. those for which a URI or literal value is not given.
Type of nodes in an RDF graph

Ontologies and Vocabularies. An ontology represents a formal description of a knowledge domain as a set of concepts relationships that hold between them. To enable such a description, we need to formally specify components such as individuals (instances of objects), classes, attributes, and relations as well as restrictions, rules, and axioms.

Ontologies do not only introduce a sharable and reusable knowledge representation but can also add new knowledge about the domain. They help data integration when ambiguities may exist on the terms used in the different datasets, and organize knowledge, also coming from datasets own by different organizations. Though it represents a more powerful model to connect and interchange data than RDF, its use and complexity depend on applications.

Logic and Inferences. Another important component of linked data is the possibility to perform inferences (or reasoning) on data through rules defined with data itself. Inference means that automatic procedures performed by inference engines (or “reasoners”) can generate new relationships based on data and some additional information in the form of an ontology. Thus the database can be used not only to retrieve information but also to deduce new information from facts in the data.

SPARQL. SPARQL is an RDF query language, namely a semantic query language for databases, able to retrieve and manipulate data stored in RDF format. Though it is not the only existing RDF query language, it is the W3C Recommendation for this purpose. SPARQL can express queries both on originally stored RDF data and on data viewed as in RDF format. It contains capabilities for querying required and optional graph patterns (more details in the next article), supporting aggregation, subqueries, negation, and limitation on the query.

The results of SPARQL queries return the resources for all triples that match the specified patterns and can be result sets or RDF graphs.

The combination of the RDF standard format for data, and of the SPARQL standard query language permits the existence of an extended version of the actual Web, which is the Semantic Web. The Semantic Web is the effort to structure the meaningful content of web pages in order to provide a means of integration over different sources of information.

Amazon Neptune. Amazon Neptune is a graph database service that simplifies the construction and the integration of applications working with highly connected datasets. Its engine is able to store billions of relationships which can be speedily navigated and queried. Amazon Neptune supports W3C’s RDF with its query language. This technology has already been used for fraud detection, recommendation engines, and much more. We used it to build FoodGraph (see the next article).

SECTION 2: Convert JSON file to RDF

The first step for building the graph database consists of converting the JSON files, containing the recipe data, into RDF triples.

After evaluating other possible alternatives, we chose a custom approach for the conversion: with few lines of code, we extracted data from the JSON file (using the Python library json) and converted it into RDF triples (in Turtle format), manually writing the RDF structure. This approach fits well with our task since the number of data types to convert is relatively few.

The procedure to build the RDF triples consists in general of three steps:

  • Prefix declaration → The prefixes identify the ontologies/vocabularies describing properties, classes, entities, and attributes used to build the graph. These elements indeed can be called in the triple via URI or using a namespace prefix. In Turtle format, the prefixes are introduced by a “@” and stand at the beginning of the Turtle document.
  • Data extraction → using the Python library json, we extract the data contained in the JSON array. This data represents the nodes of the RDF graph.
  • Writing RDF triples → using the data extracted and the ontologies declared, we manually write the RDF triples in a Turtle file. This data will be loaded on Amazon Neptune.

This is, for example, the JSON file containing 1) the output of the extractor service (see the previous article) and 2) other technical information of the NER model within the service:

This is the code we used to convert the JSON to RDF triples:

convert the JSON to RDF triples

As said above, we first declared prefixes. For example, the prefixes “schema” refers to the web page schema.org which is a collection of vocabularies used to describe entities, relations, and actions belonging to various contexts. In our case, we chose the scheme “Recipe” which contains many properties concerning the recipe context, some of which are used in the triples above (recipeIngredient, material…)

The prefix “example” stands for a fictional resource. This will be substituted with a real web page in the final version of the graph.

After declaring the prefixes, we extracted the data from the JSON file. In this piece of graph, we have URI, literal node, and blank nodes. This last (bnode_name variable, in the script) functions as a placeholder for the food entity chunk (composed of the entities extracted via NER, see the previous article). It also helps in keeping the relationship between the ingredient and the other two entities (units and quantifiers). We wrote an RDF triple for the three possible chunk cases: i) the chunk without unit, ii) the chunk without quantifier, and iii) the complete chunk (composed of ingredient, unit, and quantifier).

Into FoodGraph

At the moment FoodGraph is a three-level depth graph. In this last part, we provide a graphic visualization of the graph and an explanation of its connections. For clarity, in the figures, the properties are in the extended form and not called via prefix.

Level 1 connections:

recipe ID (URI node) → Recipe Language Code (URI node) → Recipe Language (Literal node)

recipe ID ( URI node) → Recipe content (Literal node)

The RDF triples relative to this piece of the graph are:

  • “recipe:” + id_recipe + “ dcterms:language “ + lang_dict[language] [1].
    The recipe has language code Q1860 on Wikidata (English).
  • lang_dict[language] + “ xs:string “ + lang_dict[language][0] .
    Q1860 in string format is “English”.
  • “recipe:”+recipe[“id”]+” recipe:recipeInstructions “+”instructions”.
    The recipe has this text in string format.

Level 2 connections:

Recipe ID (URI node) → extractor model date (Literal node)
Recipe ID( URI node) → Food Entity chunk (blank node)
Food Entities chunk (blank node) → ingredient id (URI node), quantifier and units (Literal node

The RDF triples relative to this piece of the graph are:

  • “recipe:”+id_recipe+” schema:dateModified ”+str(model_date).
    The date and version of the NER model that extracted the entities.
  • “recipe:”+id_recipe+” schema:material _: “+bnode_name”.
    The recipe is made of the following chunks (the blank node is a placeholder for the chunk composed of ingredients and eventually units and quantifiers )
  • “_:”+bnode_name+“ schema:materialExtent ”+str(chunk[‘unit’]).
    The chunk has a unit in string format
  • “_:”+bnode_name+” rdf:value “+str(chunk[‘value’]).
    The chunk has a quantifier in string format
  • “_:”+bnode_name+” schema:recipeIngredient example:”+ingr_id”.
    The chunk has the ingredient (ingredient_id (numerical))
  • “example:”+ingr_id+” xs:string “+ingr.rstrip().
    The recipe ingredient in string format (alphabetic)

Level 3 connections:

Parent taxonomic categories (URI node) → child taxonomic categories (URI node)

The RDF triples relative to this piece of the graph are:

  • “example:”+child_id + “ rdfs:subClassOf example:”+ parent_id+”.
    Class x is a subclass of the class y.

Ingredient ID (URI node) → Taxonomic class (URI node)

The RDF triples relative to this piece of the graph are:

  • “example:”+ing_id+” rdf:type example:”+tag+”.
    The ingredient (ingredient_id) is a type of the class x.
  • “example:”+ing_id+” datetime:dateModified “+str(model_date)+”.
    The date and the version of the classifier model, that classified the ingredients

In the next article….

We will show you how we loaded the RDF triples on Amazon Neptune, how we introduce new data in the graph via query, and how we extracted knowledge from FoodGraph.

--

--

Conde Nast Italy

Condé Nast Italia è una multimedia communication company che raggiunge un’audience profilata grazie alle numerose properties omnichannel.