Graphs in data modeling — is the emperor naked?

This text is about graphs in data modeling, possibilities for their implementation, about different data stores using different data models and query languages. The fundamental question I would like to answer is:

Are graphs and graph databases useful in data modeling, and if so, for what and under which circumstances?

The purpose of this document is to sort out some things in my brain. If others like the ideas, find them enlightening or disgusting, or do not care, then I do not really care myself.

Theoretical approach

Mathematically, a graph (directed, unlabelled, without multiple edges) is nothing but a relation. It consists of a set V of vertices and a subset E (the edges) of the Cartesian product V x V. There is an edge from v to w, if and only if the pair (v,w) is contained in E. Similarly, a bipartite graph is just a subset of a Cartesian product A x B for two disjoint sets A and B.

It is only when we go to labelled graphs (in which every edge carries a label) or multiple edges that we get a richer structure. Note that an undirected graph can just be seen as a symmetric directed one.

I find these observations interesting, because for databases, the relational model is nowadays often viewed as old-fashioned and graph databases are very fashionable indeed and many seem to believe that they are the silver bullet of data stores. The same people tend to see graphs everywhere they look and consequently try to store all data in pure graph databases.

Thus I found the insight that an unlabelled graph without multiple edges is nothing but a relation remarkable, in particular since relational databases should be particularly good at storing relations…

How to store a graph

Storing a graph in a relational database

In a relational database, we would probably store the vertices of a graph in one table and the edges in a second one. Each edge would have a foreign key for its starting vertex and one for its ending vertex. Note that for this model it is not a problem at all to keep a label along with each edge and thus loops and multiple edges as well as labelled edges are not a problem at all.

In the case of a bipartite graph, we can simply use two tables A and B for the two vertex sets, and the edge table simply contains one foreign key for A and one for B. Note that this data model is also known as “link table” or “junction table”, which is the standard solution for an m:n relation.

The fundamental query operation on a graph is to find all neighbours of a vertex. This operation can be performed easily in the above setup, but it involves a join between the vertex table with itself, using the link table (the edges). Thus, finding the neighbours of a vertex will usually involve at least some index lookup. However, when implemented correctly, this can be done with complexity O(k) where k is the number of neighbouring vertices, which cannot be beaten for obvious reasons.

We remember for the record: Storing a graph in a relational database is simple and the fundamental operation of finding all neighbours of a vertex can be implemented in perfect complexity!

Storing a graph in a pure document store

Let us now consider a “pure document store”, by which I mean a data store that keeps collections of arbitrary JSON documents. Traditionally such data stores scale very well horizontally, but do not offer joins for querying, following the rationale “joins do not scale”.

How would we store a graph in such a database? We could, for example, simply store the vertices as documents in a collection and store the keys of all neighbours of a vertex as one list attribute of the vertex document. The fact that every document in the collection can have a different layout helps us here, and at first glance this looks rather appealing, since together with the retrieval of a vertex document, we would get the list of the keys of its neighbours at the same time. This also gives us perfect complexity for the most fundamental operation for a graph.

However, the whole setup is a bit unwieldy. For example, we can no longer retrieve a vertex document without its neighbour information. Retrieving a list of the vertex documents of the neighbours involves fetching the keys of all their neighbours as well. Furthermore, adding or removing an edge in a large graph becomes a performance problem, because we would have to write a new revision of at least the starting vertex if not both end points. Finally, vertices with many connections become extremely expensive to even look at.

Note that storing the edges in a separate collection is not really an option, simply because then we would need a join to find all neighbouring vertices, which assumed to be impossible.

We remember for the record: Storing a graph in a document store without joins in the query language is not really efficient.

Storing a graph in a graph database

A graph database is made for this, so how does it work here?

In mathematics, a graph is often described by its adjacency matrix, which has rows and columns indexed by the vertices and which keeps in every cell the number of edges from the row vertex to the column vertex. This is nice theoretically but totally infeasible as a graph database implementation, because it would need too much memory (n^2 cells for a graph with n vertices) and even worse it would not allow fast access to all the neighbours of a vertex (complexity O(n), because a full row or column would have to be scanned).

So, any graph database will probably store the vertices and their data in some sort of table or collection and then store the edge or adjacency information in some more compact form.

Interestingly, the Wikipedia definition of a graph database talks about “index free adjacency”, which means that “each vertex is directly linked to its neighbouring vertices”. By this I can only understand that a graph database keeps together with each vertex some information that allows to find all the neighbouring vertices without a lookup step. This definition does not prescribe how that information is to be kept.

I do not like this at all, since in my opinion the only thing that matters is the performance of finding the neighbouring vertices — and that of other operations that change the graph. If we would for example keep with each vertex an array of direct pointers to all neighbouring vertices, then deleting an edge has larger than constant complexity. Even worse, deleting a vertex together with all its incoming edges would be a nightmare! On the other hand, if we use some kind of indexing structure with constant time lookups like a hash table, then a constant time lookup in comparison to following a direct link is negligible in the grand scheme of things.

In any case, without actually finishing the discussion as to how a graph database stores a graph, we remember that a graph database is able to find all k neighbours of a vertex with time complexity O(k). I furthermore assume that operations like adding or removing an edge are constant time operations and removing a vertex together with all incident edges has complexity O(k) where k is the number of outgoing and incoming edges. I think these complexity guarantees would be a better definition for a graph database, probably together with some assumptions about querying capabilities (see below).

Interestingly, all these complexity guarantees can be given for an implementation in a relational database as described above, using the right set of indexes!

Storing a graph in a document store with joins

We briefly note that in a document store that does offer efficient joins in the query language one can actually use a vertex collection and an edge collection and achieve all of the above complexity guarantees, and can additionally store arbitrary labelling information for both vertices and edges along with their corresponding JSON documents. To get the O(k) neighbour lookup one uses a special edge index that is a hash table tolerating repeated keys and keeping elements with equal keys together in a linked list. The joins are simply necessary to combine the edge documents with their corresponding vertices. Whoever is interested in details can contact me personally.

Querying graphs

Storing (and retrieving) a graph is one thing, but the actual problems only begin when we want to query information about a graph.

Finding the neighbours of a vertex is one crucial question one might have about a graph (or relation, which is the same thing). However, when we deal with graphs (or relations) in practice, we usually have a lot more questions, here we just mention a few that come to mind:

1. Find all neighbours of a vertex only using edges with a given property or label.
2. Find all neighbours of a vertex with a given property or label.
3. Find all paths with a fixed length L in the graph starting at some given vertex.
4. Find the shortest (or lightest when working with weights) path from vertex V to vertex W.
5. Find the distances between any two vertices in the graph.
6. Perform a depth first search for some vertex starting at a given vertex.
7. Perform a breadth first search for some vertex starting at a given vertex.
8. Find a minimal spanning tree for the graph.
9. Perform any map-reduce like computation as is possible in the Pregel framework by Google, for example “Pagerank” or “Find connected components”.
10. Solve the traveling salesman problem in the graph.

There is a lot more, but these shall suffice for a short discussion. Keeping the above ways to store a graph in mind, there seems to be a striking difference between problems 1. to 3. and the rest. For example, looking at the above model for a relational database, 1. to 3. can relatively easily be done with a simple SQL statement, since they only involve one or at most a fixed number of joins. SQL is good at joining, and when the path length grows and the number of paths explodes, any method will probably grind to a halt, simply because the result is too large. So, in a sense, they are still “good”.

However, problems 4. to 10. are very hard if not impossible to do with a single or a few simple SQL statements. The reason for this is that they involve an a priori unknown number of joins, since they involve paths of varying length. SQL does not like this, it was not invented to do this, and it is even particularly bad at this.

Let us call queries of this flavour “graphy”, without even trying to come up with a concrete definition of this word. For all the problems 4. to 10. (maybe with the notable exception of 10.), there are relatively good algorithms available, and quite often the resulting data is not too large. These algorithms can generally perform well, when the fundamental problem of finding the neighbours of a vertex can be solved quickly. However, most of these algorithms involve a lot of such neighbour accesses. Therefore it is not sensible to run these algorithms outside of the data store and send queries whenever the neighbours of a vertex are needed.

This leads us to a further hard requirement for a graph database: It must be able to perform graphy queries and questions in the database server efficiently, and one must be able to issue these queries conveniently in the native query language.

We remember for the record: What really makes a graph database is not a nonsensical buzzword like index-free adjacency but hard complexity guarantees for the fundamental operations paired with graphy querying capabilities.

Scalability

I want to keep this section brief, however, what I find striking, is that both joins and graphy queries are generally considered to be hard to scale well. The above arguments show that this is no coincidence, rather, for most sensible ways to store a graph, the two are essentially the same problem and therefore also pose the same challenges with respect to scalability.

Thinking beyond graphs in data modeling

My conclusion at this stage is that despite the fact that graphs are just fancy relations, the emperor “graphs in data modeling” is not actually naked. Rather, graphs as a concept for data modeling are indeed useful and one would like to have data stores that can not only efficiently store them but also conveniently query them with graphy questions. A fundamental tool for this are joins.

However, even when equipped with a graph database, one should not believe that all data is graphs. Even with the best hammer in your hand you should not hit screws with it.

Rather, what is desirable in data modeling is to have a large arsenal of data models available and backed by a data store that not only allows to use these models for storage, but is also capable of querying all these models with its native query language. This becomes really interesting when one can mix the different models in a single query, which is why multi-model databases that can perform joins are seeing a surge of interest currently.

Conclusions

As I wrote in the beginning, this document is written to sort my personal thoughts. Therefore, its conclusions are also my personal conclusions, which can obviously be controversial for others.

1. The relational data model is OK for graphs, but SQL as query language seriously limits capabilities for “graphy” queries.
2. A pure document store without joins is too limited to sensibly work with graphs.
3. The Wikipedia definition of a graph database is not very helpful, concentrates on the wrong features (index-free adjacency) and misses important points (graphy query capabilities).
4. A pure graph database is good if your data is graph (not all is!) and you need only “graphy” queries.
5. The real quantum leap in data modeling flexibility is only achieved with a database that allows to mix SQL like index-supported queries with joins and graphy queries, in the same query language and database engine.
6. Multi-model databases with documents, graphs and key/value pairs and a query language that supports index-queries, joins and graphy aqueries re the future.

Written by