# Modelling Data with Hypergraphs

## A closer look at Grakn Labs hypergraph data model

In our previous post “Knowledge Graph Representation: GRAKN or OWL?”, we explained why Grakn implements its own knowledge representation formalism, rather than employing the popular W3C standard Web Ontology Language (OWL), underpinning the Semantic Web stack.

In this article, we move one level deeper to take a closer look at our proposed *hypergraph data model* (HDM) that provides the formal foundation for Grakn knowledge graphs. The HDM, in this sense, plays the same role in Grakn as Codd’s relational model in SQL databases and directed graphs — via the RDF layer — on the Semantic Web, or in popular graph databases, built around the concept of property graphs. In fact, we believe that HDM constitutes the missing link between these two paradigms, offering a novel, promising alternative in the data management space.

In this post we assume a basic knowledge of data modelling and formal foundations of databases, although key terms are explained. You’ll find this article useful if you are considering using Grakn to model complex, highly interconnected data, and want to understand the key differentiators between the data model underlying Grakn and those employed in relational and RDF/property graph stores.

**Hypergraph data model**

*Hypergraphs* generalise the common notion of graphs by relaxing the definition of edges. An edge in a graph is simply a *pair* of vertices. Instead, a *hyperedge* in a hypergraph is a *set* of vertices. Such sets of vertices can be further structured, following some additional restrictions involved in different possible definitions of hypergraphs.

The *hypergraph data model *(HDM) that we have developed and proposed as the formal foundation of Grakn, is based on a specific notion of hypergraphs, the structure of which can be derived from three basic premises:

- a hypergraph consists of a non-empty set of vertices and a set of hyperedges;
- a hyperedge is a finite set of vertices (distinguishable by specific roles they play in that hyperedge);
- a hyperedge is also a vertex itself and can be connected by other hyperedges.

For instance, the figure below depicts one such hypergraph consisting of four vertices (*bob*, *alice*, *m-1*, *df-1*), two of which happen to be also hyperedges:

*m-1*, describing a binary*marriage*relationship between*bob*and*alice*playing the roles of*husband*and*wife*, respectively;*df-1*, describing a ternary divorce filing relationship involving three role-players in the roles of*certified marriage*,*petitioner*and*respondent*.

In other words, a hyperedge can be simply seen as a collection of role-role-player pairs of arbitrary cardinality. The role-players in such hyperedges can be simple vertices, such as *bob* or *alice*, but also other hyperedges, as is the case with *m-1* playing the role of *certified marriage* in *df-1*. In terms of data modelling, hypergraphs of this structure offer a very natural and straightforward data representation formalism, closely aligned with popular conceptual modelling frameworks such as entity-relationship diagrams. The basic modelling proviso in HDM is this: entities are simple vertices, while relationships of any arity are modelled uniformly as hyperedges.

While several alternative generalisations of hypergraphs are known and have been studied, the definition adopted here has been recognised and received particular attention in the context of knowledge representation *[1]*, general AI *[2]*, and graph- and object-oriented databases *[3,4]*. The appeal of this formulation stems predominantly from the way it naturally generalises over a number of information modelling structures ubiquitous in data engineering and AI, including: first-order relations, graphs, nesting of information patterns, meta-modelling, and others. This characteristic turns out to be invaluable also in the context of representing and reasoning over knowledge graphs.

**Hypergraphs vs. relations vs. directed graphs**

What then is the core value of using hypergraphs over other prominent data representation structures, such as the *relational model* (underlying SQL databases) or variants of the *directed graph** model* (underlying graph databases and RDF triple stores)? Interestingly, HDM fits naturally within this landscape. While being essentially equally expressive as those, it endorses a somewhat differently balanced data modelling approach, and thus offers a largely missing link between these existing alternatives. In the following paragraphs, we explain these key differentiating aspects and the benefits of modelling data using HDM.

**Relations**

Let us consider a typical representation of the same data set as used in the previous example in SQL tables (or formally speaking *relations*).

The correspondence to the hypergraph above is evident. Simple vertices in the hypergraph (*bob*, *alice*) correspond to the records of the “entity” tables (*Man*, *Woman*), while the hyperedges (*m-1*, *df-1*) directly reflect the contents of each record from the “relationship” tables (*Marriage*, *Divorce_filing*). Thus, the structuring of information characteristic and critical to the SQL-style modelling, where relevant pieces of data are grouped together in records, is faithfully preserved in hypergraphs. In general, the hypergraph data model can seamlessly accommodate any relational information.

Yet, the connected nature of this information (i.e., the fact that the presence of foreign keys in the records indicates the existence of actual relationships between entities) is made much more prominent in the hypergraph representation. This is a critical difference not only on the conceptual level, when just exploring the data, but also in a deeper, technical sense, when expressing and executing complex traversal queries, which are well-known bottlenecks of SQL databases.

Furthermore, the strictness of the SQL schema makes it very hard to cater for irregular or incomplete data instances, e.g., an odd instance of polygamous marriage showing up in the data set, or an instance with missing information about one of the spouses. Similarly, it is cumbersome to revise the structuring of data once the SQL database has been defined and populated. As has been often argued by graph data advocates, this shortcoming is naturally alleviated in much more flexible graph-based models, in which aspect hypergraphs are no different. All three hyperedges (*husband*:*bob*, *wife*:*alice*), (*husband*:*mark*) and (*husband*:*jacob*, *wife*:*gloria*, *wife*:*gertrude*) can naturally coexist as instances of the *marriage* relation in the same hypergraph.

Finally, the native graph-oriented structure of hypergraphs make them also an obvious ground for applying advanced graph computing techniques, such as shortest path finding or network analysis, which we will explore in depth in future posts.

**Directed graphs**

Is, then, a hypergraph anything more than just a graph? Arguably, the hypergraph depicted in the beginning of this post could be seen as a plain labelled directed graph. That’s actually a very desirable characteristic. Every hypergraph can simply be mapped to the corresponding directed graph. For instance, in the RDF model, we could represent our marriage example as the following RDF graph, where entity and relation types are also explicitly encoded as RDF resources, in the typical RDF style.

Virtually the same mapping could be applied to achieve a direct reduction of hypergraphs to the property graph model. Because of this close relationship to directed graphs, HDM can be naturally implemented over any graph-based data storage, such as increasingly popular graph databases or RDF triple stores. In fact, Grakn is mounted on top of TinkerPop — an open-source interoperability layer that exposes a uniform graph data model (property graph) over any TinkerPop-compliant data management system. This allows us to greatly reduce the cost of developing a robust and mature system and grants a good level of vendor-agnosticity in terms of the choice of the underlying storage platform.

The central differentiator between hypergraphs and directed graphs, however, is the introduction of hyperedges as first-class modelling constructs in HDM. Firstly, hyperedges have a significant impact on the data modelling practice and possibilities, when compared to directed graph models. In principle, it is possible one could arrive immediately at the RDF graph depicted above when modelling our example dataset, thus achieving the same clean and uniform representation without explicitly employing HDM. In a real-life scenario, however, when the complete conceptual model is not fully foreseen at the outset of the process, and the data is injected to the graph gradually over time, the actual outcome may be considerably different. In fact, RDF practitioners should not be surprised to see, as the result of this exercise, the RDF graph below instead.

To start with, the binary *marriage* relation would typically be represented using a directed edge involving the *married_to *predicate, following the standard “good practice” recommended by popular RDF tutorials. Later on, however, once the married couple files for divorce, and the graph must account for this new fact, the RDF data modeller faces two problems:

- Modelling ternary
*divorce_filing*relationship has to follow a different route than that employed in case of binary relations; instead of asserting a link between two entities, one needs to use a dedicated ontology pattern for n-ary relations, which would typically mimic the hyperedge structure: one resource representing the relation object connected with a set of outgoing links towards respective role-players. - There’s no simple way to connect the original binary marriage relationship with other domain elements. In this respect, RDF offers a notorious triple reification mechanism, which allows for introducing a dedicated resource (of type
*rdf:Statement*) representing the entire triple and linked to all its constitutive components via predicates*rdf:subject*,*rdf:predicate*and*rdf:object*. RDF reification has attracted a lot of criticism in the past, essentially by being a crude tool immediately damaging the transparency of the model and its usability on the query level.

The second benefit that hyperedges can bring to the data management table, as compared to the binary directed edges, is the potential improvements to query planning and query optimisation mechanisms. Arguably, the data grouped together in the same structural “containers”, precisely such as hyperedges or SQL records, is also often retrieved in similar groupings by users and applications. By acknowledging the structure of these collections in advance of querying, the information retrieval process can be more optimally planned and executed.

**Summary**

The hypergraph data model underpinning the knowledge representation system implemented in Grakn, presents a novel alternative in the data modelling space, providing a viable middle-ground between the relational and directed graph-based models — i.e., “the best of both worlds”. The key points supporting this view, which we have touched upon in this post, can be summarised as follows:

**Benefits over the relational model**

- native graph-oriented structure:

—relationships (connections) in data are first class-citizens;

— graph-oriented data modelling frameworks (e.g., ER diagrams) can be more easily applied and linked to the actual data model;

— graph-oriented computation techniques can be efficiently applied (e.g., path queries, network analysis);

- the flexibility of data modelling is not impeded by restrictive relational schemas (e.g., multiple values of the same role can be involved in the same hyperedge, without an a priori modelling decision).

**Benefits over the directed graph model**

- the natural mechanism of grouping relevant pieces of information in a relational style, which is to a large extent lost in directed graphs;
- uniform handling of all n-ary relationships, as opposed to directed graphs where n-ary relations for n >2 require a radical change in the modelling approach (so-called
*n-ary relation patterns*) compared to the case of n=2; - a natural way of expressing higher-order information (relations between relations, nesting of information, etc.), which in directed graphs requires dedicated modelling techniques (so-called
*reification*).

Grakn is a distributed knowledge base with a reasoning query language that enables you to query for explicitly stored data and implicitly derived information. Learn more at grakn.ai, and take a look at our documentation.

If you have any questions, we are always happy to help: a good way to ask is via our Slack channel. We also have a discussion forum. For news, sign up for our community newsletter and — if you’d like to meet us in person — we run regular meetups.

*With thanks to my fellow contributors **Nicholas D**, **Jo Stichbury**, **Haikal Pribadi**, **Borislav Iordanov** and **Precy Kwan** for their input.*

[1] Harold Boley. *Directed recursive labelnode hypergraphs: A new representation-language*. Artificial Intelligence, 9(1), 1977.

[2] Ben Goertzel. *Patterns, hypergraphs & embodied general intelligence*. In IEEE World Congress on Computational Intelligence, 2006.

[3] Borislav Iordanov. *HyperGraphDB: A generalized graph database*. In Proceedings of the Web-Age Information Management Workshop (WAIM2010), 2010.

[4] Mark Levene and Alexandra Poulovassilis. *An object-oriented data model formalised through hypergraphs*. Data & Knowledge Engineering, 6(3), 1991.