Modelling Data with a Hypergraph Database

A closer look at the TypeDB hypergraph data model

Szymon Klarman
Vaticle
9 min readApr 19, 2017

--

M. C. Escher, Day and Night

In our previous post “Knowledge Graph Representation: TypeDB or OWL?”, we explained why TypeDB implements its own knowledge representation formalism, rather than employing the popular W3C standard Web Ontology Language (OWL), underpinning the Semantic Web stack.

In this article, we move one level deeper to take a closer look at our proposed hypergraph data model that provides the formal foundation for TypeDB knowledge graphs. The hypergraph data model, in this sense, plays the same role in TypeDB as Codd’s relational model in SQL databases and directed graphs — via the RDF layer — on the Semantic Web, or in popular graph databases, built around the concept of property graphs. In fact, we believe that hypergraph data model, in the form of a hypergraph database, constitutes the missing link between these two paradigms, offering a novel, promising alternative in the data management space.

In this post we assume a basic knowledge of data modelling and formal foundations of databases, although key terms are explained. You’ll find this article useful if you are considering using TypeDB (as the hypergraph database) to model complex, highly interconnected data, and want to understand the key differentiators between the data model underlying TypeDB and those employed in relational and RDF/property graph stores.

Hypergraph data model

Hypergraphs generalise the common notion of graphs by relaxing the definition of edges. An edge in a graph is simply a pair of vertices. Instead, a hyperedge in a hypergraph is a set of vertices. Such sets of vertices can be further structured, following some additional restrictions involved in different possible definitions of hypergraphs.

The hypergraph data model that we have developed and proposed as the formal foundation of TypeDB, is based on a specific notion of hypergraphs, the structure of which can be derived from three basic premises:

  1. a hypergraph consists of a non-empty set of vertices and a set of hyperedges;
  2. a hyperedge is a finite set of vertices (distinguishable by specific roles they play in that hyperedge);
  3. a hyperedge is also a vertex itself and can be connected by other hyperedges.

For instance, the figure below depicts one such hypergraph consisting of four vertices (bob, alice, m-1, df-1), two of which happen to be also hyperedges:

  • m-1, describing a binary marriage relationship between bob and alice playing the roles of husband and wife, respectively;
  • df-1, describing a ternary divorce filing relationship involving three role-players in the roles of certified marriage, petitioner and respondent.

In other words, a hyperedge can be simply seen as a collection of role-role-player pairs of arbitrary cardinality. The role-players in such hyperedges can be simple vertices, such as bob or alice, but also other hyperedges, as is the case with m-1 playing the role of certified marriage in df-1. In terms of data modelling, hypergraphs of this structure offer a very natural and straightforward data representation formalism, closely aligned with popular conceptual modelling frameworks such as entity-relationship diagrams. The basic modelling proviso in the hypergraph data model is this: entities are simple vertices, while relationships of any arity are modelled uniformly as hyperedges.

While several alternative generalisations of hypergraphs are known and have been studied, the definition adopted here has been recognised and received particular attention in the context of knowledge representation [1], general AI [2], and graph- and object-oriented databases [3,4]. The appeal of this formulation stems predominantly from the way it naturally generalises over a number of information modelling structures ubiquitous in data engineering and AI, including: first-order relations, graphs, nesting of information patterns, meta-modelling, and others. This characteristic turns out to be invaluable also in the context of representing and reasoning over knowledge graphs.

Hypergraphs vs. relations vs. directed graphs

What then is the core value of using hypergraphs (and hypergraph databases) over other prominent data representation structures, such as the relational model (underlying SQL databases) or variants of the directed graph model (underlying graph databases and RDF triple stores)? Interestingly, the hypergraph data model fits naturally within this landscape. While being essentially equally expressive as those, it endorses a somewhat differently balanced data modelling approach, and thus offers a largely missing link between these existing alternatives. In the following paragraphs, we explain these key differentiating aspects and the benefits of modelling data using the hypergraph data model.

Relations

Let us consider a typical representation of the same data set as used in the previous example in SQL tables (or formally speaking relations).

The correspondence to the hypergraph above is evident. Simple vertices in the hypergraph (bob, alice) correspond to the records of the “entity” tables (Man, Woman), while the hyperedges (m-1, df-1) directly reflect the contents of each record from the “relationship” tables (Marriage, Divorce_filing). Thus, the structuring of information characteristic and critical to the SQL-style modelling, where relevant pieces of data are grouped together in records, is faithfully preserved in hypergraphs. In general, the hypergraph data model can seamlessly accommodate any relational information.

Yet, the connected nature of this information (i.e., the fact that the presence of foreign keys in the records indicates the existence of actual relationships between entities) is made much more prominent in the hypergraph representation. This is a critical difference not only on the conceptual level, when just exploring the data, but also in a deeper, technical sense, when expressing and executing complex traversal queries, which are well-known bottlenecks of SQL databases.

Furthermore, the strictness of the SQL schema makes it very hard to cater for irregular or incomplete data instances, e.g., an odd instance of polygamous marriage showing up in the data set, or an instance with missing information about one of the spouses. Similarly, it is cumbersome to revise the structuring of data once the SQL database has been defined and populated. As has been often argued by graph data advocates, this shortcoming is naturally alleviated in much more flexible graph-based models, in which aspect hypergraphs are no different. All three hyperedges (husband:bob, wife:alice), (husband:mark) and (husband:jacob, wife:gloria, wife:gertrude) can naturally coexist as instances of the marriage relation in the same hypergraph.

Finally, the native graph-oriented structure of hypergraphs make them also an obvious ground for applying advanced graph computing techniques, such as shortest path finding or network analysis, which we will explore in depth in future posts.

Directed Graphs

Is, then, a hypergraph anything more than just a graph? Arguably, the hypergraph depicted in the beginning of this post could be seen as a plain labelled directed graph. That’s actually a very desirable characteristic. Every hypergraph can simply be mapped to the corresponding directed graph. For instance, in the RDF model, we could represent our marriage example as the following RDF graph, where entity and relation types are also explicitly encoded as RDF resources, in the typical RDF style.

Virtually the same mapping could be applied to achieve a direct reduction of hypergraphs to the property graph model. Because of this close relationship to directed graphs, the hypergraph data model can be naturally implemented over any graph-based data storage, such as increasingly popular graph databases or RDF triple stores. In fact, TypeDB (the hypergraph database) is built on top of our own graph engine.

The central differentiator between hypergraphs and directed graphs, however, is the introduction of hyperedges as first-class modelling constructs in the hypergraph data model. Firstly, hyperedges have a significant impact on the data modelling practice and possibilities, when compared to directed graph models. In principle, it is possible one could arrive immediately at the RDF graph depicted above when modelling our example dataset, thus achieving the same clean and uniform representation without explicitly employing the hypergraph data model. In a real-life scenario, however, when the complete conceptual model is not fully foreseen at the outset of the process, and the data is injected to the graph gradually over time, the actual outcome may be considerably different. In fact, RDF practitioners should not be surprised to see, as the result of this exercise, the RDF graph below instead.

To start with, the binary marriage relation would typically be represented using a directed edge involving the married_to predicate, following the standard “good practice” recommended by popular RDF tutorials. Later on, however, once the married couple files for divorce, and the graph must account for this new fact, the RDF data modeller faces two problems:

  1. Modelling ternary divorce_filing relationship has to follow a different route than that employed in case of binary relations; instead of asserting a link between two entities, one needs to use a dedicated ontology pattern for n-ary relations, which would typically mimic the hyperedge structure: one resource representing the relation object connected with a set of outgoing links towards respective role-players.
  2. There’s no simple way to connect the original binary marriage relationship with other domain elements. In this respect, RDF offers a notorious triple reification mechanism, which allows for introducing a dedicated resource (of type rdf:Statement) representing the entire triple and linked to all its constitutive components via predicates rdf:subject, rdf:predicate and rdf:object. RDF reification has attracted a lot of criticism in the past, essentially by being a crude tool immediately damaging the transparency of the model and its usability on the query level.

The second benefit that hyperedges can bring to the data management table, as compared to the binary directed edges, is the potential improvements to query planning and query optimisation mechanisms. Arguably, the data grouped together in the same structural “containers”, precisely such as hyperedges or SQL records, is also often retrieved in similar groupings by users and applications. By acknowledging the structure of these collections in advance of querying, the information retrieval process can be more optimally planned and executed.

Summary

The hypergraph data model underpinning the knowledge representation system implemented in TypeDB (the hypergraph database), presents a novel alternative in the data modelling space, providing a viable middle-ground between the relational and directed graph-based models — i.e., “the best of both worlds”. The key points supporting this view, which we have touched upon in this post, can be summarised as follows:

Benefits over the relational model

  • native graph-oriented structure:

—relationships (connections) in data are first class-citizens;

— graph-oriented data modelling frameworks (e.g., ER diagrams) can be more easily applied and linked to the actual data model;

— graph-oriented computation techniques can be efficiently applied (e.g., path queries, network analysis);

  • the flexibility of data modelling is not impeded by restrictive relational schemas (e.g., multiple values of the same role can be involved in the same hyperedge, without an a priori modelling decision).

Benefits over the directed graph model

  • the natural mechanism of grouping relevant pieces of information in a relational style, which is to a large extent lost in directed graphs;
  • uniform handling of all n-ary relationships, as opposed to directed graphs where n-ary relations for n >2 require a radical change in the modelling approach (so-called n-ary relation patterns) compared to the case of n=2;
  • a natural way of expressing higher-order information (relations between relations, nesting of information, etc.), which in directed graphs requires dedicated modelling techniques (so-called reification).

TypeDB is a strongly typed database that enables you to query for explicitly stored data and implicitly derived information. Learn more at vaticle.com, and take a look at our documentation.

If you have any questions, we are always happy to help: a good way to ask is via our Discord channel. We also have a discussion forum. For news, sign up for our newsletter and — if you’d like to meet us in person — we run regular meetups.

With thanks to my fellow contributors Nicholas D, Jo Stichbury, Haikal Pribadi, Borislav Iordanov and Precy Kwan for their input.

--

--

Szymon Klarman
Vaticle

knowledge representation and reasoning | knowledge graphs | linked open data | AI | logic | https://klarman.me