Surveying the Landscape of Graph Data Management Systems
I wrote this as a companion post to a short talk I gave at a panel recently, where I discussed how to make sense of the graph data management and analytics space, and what options and systems are available today to someone interested in exploring graphs. I am not aware of a high-level overview of graph databases and graph analytics engines, that places them in context, and discusses the applicability and the limitations of different systems. This post will hopefully serve that purpose to some extent.
This is not intended as a comprehensive survey and I likely have missed important updates to some of the systems in this fast-moving area; I would be happy to receive corrections.
This discussion is targeted at readers who already have some idea about why they may want to do graph analytics or graph querying. Briefly, a graph (synonymously, a network) allows one to model and reason about interconnections (edges) between entities or objects (nodes). For example, in a social network, the nodes typically represent users, groups, and/or posts, whereas the edges denote some association between those. A graph is called homogeneous if all the nodes are of the same type, and heterogeneous otherwise.
Modeling data as graphs enables one to reason about the entities and the interconnections through use of graph algorithms. The most famous example of this is PageRank, originally used by Google to find “importance” of a Web page based on the link structure. Since then, there has been a tremendous amount of work on extracting insights from graphs, often under the umbrella of graph analytics, or network science. See the excellent Wikipedia article on Network Science for more examples. The application domains where graph analytics are regularly applied include social media, finance, communication networks, biological networks, and many others.
There are also more basic reasons for considering a graph database. Graphs are often a more natural way to think about some types of data, given they require explicitly identifying entities and relationships (somewhat similarly to Entity-Relationship Models). The more important consideration should be the query workload though. It may be the case that the typical queries or tasks on the data require exploring the interconnections or traversals through the graph. Maps are a natural example of this, where most queries are about shortest paths between nodes (e.g., two addresses). Although transitive closure can handle this (in theory) on relational databases, specialized implementations are a must in practice. Semantic Web data is also more naturally represented as graphs, and the typical search queries require traversing the graph or exploring the structure around specific nodes. Semi-structured nature of the Semantic Web data also makes a graph data model more attractive here.
Graph Data Management is a Messy Topic
There is no clear definition of what graph data management means, which has (in my opinion) resulted in a very fragmented world, with a large number of systems and papers that are hard to reconcile. There are several native graph databases like Neo4j, Titan, OrientDB, and several others. There are quite a few RDF databases (often called “triplestores”), which often support graph querying and analytics. There is a lot of work on large-scale distributed graph analytics systems, with some examples being Pregel, Apache Giraph, GraphLab, GraphX, Flink Gelly, etc. (I co-presented a tutorial at SIGMOD 2016 on this sub-topic, and we are finishing up a survey which should come out soon).
Quite a few of these distributed systems, however, primarily focus on machine learning algorithms (e.g., GraphLab) and not on general-purpose graph analytics. Several big relational database vendors provide in-built support for limited graph analysis tasks, including Aster SQL-Gr, SAP HANA Graph, and Oracle Spatial and Graph. There is also a lot of work on specific types of graph analysis or mining tasks and specific graph queries, but most of those works don’t attempt to build general-purpose systems. There is also GraphQL, which is unfortunately named in my opinion, since it is not really a graph query language (see the end of the post for more on that).
How should you handle graphs?
Given that there is no one system that can handle the spectrum of reasons for which one may need to use graphs, what one should do really depends on answers to several questions.
- Are you starting from scratch, or need to work with an existing deployed DBMS? This is likely to be the case for many users, especially since graph analytics tend to come in play after pipelines for non-graph analytics have been established. A complete shift is rarely an option in such cases.
- What fraction of the overall workload is expected to be graph-oriented? Do you need to do updates through graph interfaces? How frequently do you need to do graph analytics? Once-in-a-while or continuously? Some users may only want to run batch graph analysis tasks once-in-a-while and may have established data paths for updates (in which case, you have many options), whereas others may need to run graph queries and analytics
continuously, on fresh data (a customized solution is probably the only option right now).
- What types of graph queries/analytics are needed? PageRank is a lot easier than Community Detection or most other analysis tasks described in the Network Science link above. Heterogeneous graphs are trickier to handle than homogeneous graphs.
- Do you see a need to write your own graph algorithms? Some of the libraries out there are quite extensive, but I have observed that graph processing tasks show minor-but-crucial variations in most applications. Although simple to state, such variations can often require new algorithms altogether. For example, shortest paths on weighted graphs is pretty straightforward, but you may need a shortest path with some constraints on the edge weights. This may be doable as pre- or post-processing step in some cases, but in others, you may need to implement a new algorithm. Direct access to the graph structure (that allows arbitrary operations) would be needed in such cases.
- How large would be the graph? Complex graph algorithms are hard to parallelize, which makes this a crucial question. Even if the original data is very large, it may be the case that the graph itself is sufficiently small and fits in the memory of a large machine. Ligra was able to handle graphs with 13 billion edges in less than 256GB of memory in 2013. McSherry et al., consider this issue more systematically and show that even single-threaded implementations can compete with large clusters because of the overheads of parallelization.
At a high level, there are three architectures that we commonly see for graph data management, as seen in the figure below. The first one is discussed in further detail below as Option 1, and the other two are discussed below as Option 2.
Option 1: Bolt-on Graph Analytics over Existing Databases
This is perhaps the appropriate solution for most people, especially if: (1) you need to work with an existing DBMS (e.g., a relational database), (2) if a large fraction of the overall workload is expected to be non-graph (including, importantly, updates), and (3) graph analytics only need to be done once-in-a-while and it is okay to do it on stale data.
Figure below shows what kind of solution one may consider in such cases. In essence, this involves setting up a pipeline to extract a serialized version of the graph from the underlying database, e.g., this can be done by writing SQL queries to generate the nodes and the edges, and then converting it into some standard graph format, like GML or just plain CSV files.
Downsides: Setting up such pipelines can be tedious and time-consuming; in many such cases, you may not even know what graphs are of interest (there are typically many graphs one may extract from a database). Our ongoing work on GraphGen is motivated by this observation, and aims to simplify graph analysis over existing database deployments. You would also need to use two different systems for this, although several relational database vendors support limited in-built graph analytics (e.g., Aster SQL-Gr, SAP HANA, Oracle) — e.g., Aster supports a library of functions (e.g., PageRank) and also allows writing new algorithms using “think-like-a-vertex” paradigm discussed below.
Now the question boils down to which specific system to use for the actual graph analysis itself. The figure below shows some of the options.
- NetworkX (Python Library): Highly functional, but single-machine and presumably not very efficient (see here for some comparisons — I am not familiar with the other tools mentioned there). You can write your own graph algorithms against the raw API provided by NetworkX.
- Ringo (Stanford): This is also a highly functional library, written in C++, with over 200 graph algorithms in it.
- Giraph, GraphX, GraphLab, Pregel: All of these are distributed, cluster-computing frameworks, that parallelize graph computation using some variant of “think-like-a-vertex” programming model. There is an immense amount of work on this style of system. The main issues with these systems is that, the “think-like-a-vertex” programming model is quite limited and sacrifices performance for scaling out. See McSherry et al., or our work on NScale, for a more detailed discussion of this. As the former paper notes, using this programming model often forces you into using sub-optimal algorithms (e.g., label propagation for connected components) since the better algorithms are often hard to fit into the think-like-a-vertex paradigm.
- Ligra, Galois, GreenMarl: These systems target a multi-core, shared-memory setting, and support more expressive programming models, that allow implementing complex graph algorithms easily. As far as I know, although implementations are available in some cases, none of these are widely used as yet.
- GraphChi, X-Stream: Both of these support out-of-core execution on a single machine, by swapping data in and out of memory. The programming model is still a variation on think-like-a-vertex.
Option 2: Graphs All the Way
At the other extreme, if you have the flexibility to migrate all your data to a new database system, and if most of the workload is graph-oriented, then it may be appropriate to use a database system that uses a graph data model and exposes a graph query language. The term “graph database” is perhaps best reserved for such systems. The main disadvantages of going with graph databases today are that: (1) you need to buy into those relatively-underdeveloped ecosystems, and (2) there are few options, none of which are well-established yet. Also, large-scale batch analytics may require use of one of the systems from above.
A graph database may be designed and built from scratch to support a graph data model (e.g., Neo4j), or it may be implemented as a layer on top of an existing data management system (e.g., Titan, SQLGraph, Vertexica). In theory, the former option should allow better performance; however, it is often the case that the second option is at least competitive, if not better, for many types of queries, primarily because it benefits from the research that has gone into developing general-purpose database systems.
Although usually associated with Semantic Web, the base RDF data model maps naturally to directed graphs, and thus RDF databases are a viable option for storing graph data. The standard query language for RDF is SPARQL, which is similar to SQL in many ways, and revolves around a subgraph pattern matching paradigm.
There are numerous implementations of RDF data model, which are often called “triplestores”. Some of those are built as a layer on top of a relational database system (e.g., SW-Store), whereas others use specialized indexes and query processing techniques more suited to RDF and SPARQL (e.g., AllegroGraph, RDF-3X, Blazegraph (which uses GPUs)). This research paper from a few years ago is a good summary of the different ways to implement RDF databases, although the discussion there may be outdated by now.
Property Graph Data Model
RDF is somewhat impoverished as a graph data model since the nodes and the edges can only have one label each, and one cannot associate “properties” with nodes, which is natural in most graphs or networks. Any property to be associated with a node or an edge requires creation of another node, which makes the graphs unnecessarily large to explore and query. The property graph data model allows nodes and edges to have arbitrary properties, encoded as key-value pairs.
The most commonly used query languages for property graphs are Cypher and Gremlin. Cypher is pretty similar to SPARQL in many ways, but has better support for graph traversals. Gremlin is much more expressive and allows traversing the graph in arbitrary ways and performing arbitrary computations along the way; for me, it is more akin to a scripting language to operate on graphs rather than a high-level declarative language like SQL or Cypher.
As with RDF databases, there are different implementation strategies. Neo4j is a true, native graph databases and stores the nodes and edges directly. Titan, on the other hand, uses a key-value store (e.g., Cassandra or HBase) as the backend, which allows it to store the graph in a distributed fashion. However, query execution in Titan is still centralized (last I checked).
SQLGraph is a research prototype that shows how to support a property graph database (including a subset of Gremlin) on top of a relational database system, through many clever optimizations to handle flexible schemas and large numbers of properties. Something like this would perhaps the best overall option, in my opinion, if it were to be developed more fully. Vertexica and GRAIL are two other similar prototypes, that are focused more on analytics than queries.
CayleyDB also appears to take a similar approach as SQLGraph, but I haven’t looked closely at it. On the other hand, Dgraph, another recent project, aims to build a new distributed graph data management system from scratch.
Many property graph databases provide direct access to the graph using a standard API, so you can write new graph algorithms easily — as far as I know, the RDF databases do not support such functionality.
Isn’t GraphQL another graph database?
No: GraphQL is somewhat unfortunately named in my opinion (doesn’t help that it was developed by Facebook, which uses graph algorithms heavily). GraphQL is an alternative to REST, and allows retrieving resources in a uniform manner. In particular, it allows retrieving the properties of one resource along with any resources that are connected to them, in a single query; something like that would require multiple requests using REST APIs. As a query language, GraphQL could be seen as a small subset of SPARQL or Cypher (Dgraph, mentioned above, plans to use GraphQL as its query language).
What about tree-structured data? Aren’t trees a special case of graphs?
Yes, but trees have a lot more structure, with the ancestor/descendant relationship taking a central role, and the queries of interest are often quite different than graphs. XML or JSON databases that support “path” queries are usually a better option for this, and even relational databases are sufficient in many cases. See here for a discussion on this topic.
Hopefully this overview helps in understanding the landscape of graph data management systems of today. If you have questions or suggestions, feel free to contact me at email@example.com.