Graph Database Superpowers

Unraveling the back-story of your favorite graph databases

Steve Sarsfield
Mar 18 · 8 min read

The graph database market is very exciting, as the long list of vendors continues to grow. You may not know that there are huge differences in the origin story of the dozens of graph databases on the market today. It’s this origin story that greatly impacts the superpowers and weaknesses of the various offerings.

While Superman is great at flying and stopping locomotives, you shouldn’t rely on him around strange glowing metal. Batman is great at catching small-time hoods and maniacal characters, but deep down, he has no superpowers other than a lot of funding for special vehicles and handy tool belts. Superman’s origin story is much different than Batman’s, and therefore the impact they have on the criminal world is very different.

This is also the case with graph databases. The origin story absolutely makes a difference when it comes to strengths and weaknesses. Let’s look at how the origin story of various graph databases can make all the difference in the world when it comes to use cases for the solutions.

Graph Superhero: RDF and SPARQL databases

Examples: Ontotext, AllegroGraph, Virtuoso and many others

Superpower: Semantic Modeling. Basic understanding of concepts and the relationships between those concepts. Enhanced context with the use of ontology. Sharing data and concepts on the web. These databases often support OWL and SHACL, which help with the process of describing what the data should look like and the sharing of data like we share web pages.

Kryptonite: The RDF original specification did not account for properties on predicates very well. So for example, if I wanted to specify WHEN Sue became a friend of Mary, or the fact that Sue is a friend of Mary, according to Facebook, handling provenance and time may be more cumbersome. Many RDF databases added quad-store options where users could handle provenance or time, and several are adding the new RDF* specification to overcome shortcomings. More on this in a minute.

Many of the early RDF stores were built on transactional architecture, so that they scaled somewhat to handle transactions, but had size restrictions on performing analytics on many triples.

It is in this category that the vendors have had some time to mature. While the origins may be in semantic web and sharing data, many have stretched their superpowers with labeled properties and other useful features.

Graph Superhero: Labeled Property graph with Cypher

Example: Neo4j

SuperPower: Although the origin story is about web site content taxonomies, it turns out that these types of databases were also pretty good for 360-degree customer view applications and understanding multiple supply chain systems. Cypher, although not a W3C or ISO standard, has become a de facto standard language as the Cypher community has grown with Neo4j’s efforts. Neo4j also has been an advocate of the new upcoming GQL standard, which may result in a more capable Cypher language.

Kryptonite: Neo4j has built its own system from the ground up on a transactional architecture. Although some scaling features have recently been added to Neo4j version 4, the approach is more about federating queries rather than an MPP approach. In version 4, the developers have added manual sharding and a new way to query sharded clusters. This requires extra work when sharding and writing your queries. This is a similar approach to transactional RDF stores where SPARQL 1.1, supports integrated federated queries through a SERVICE clause. In other words, you may still encounter limits when trying to scale and perform analytics. Time will tell if the latest federated approach is scalable.

Ontologies and inferencing are not standard features with a property graph, although some capability is offered here with add-ons. If you’re expecting to manage semantics in a property graph, it’s probably the wrong choice.

Graph Superhero: Proprietary Graph

Example: TigerGraph

Superpowers: Through Tiger, the market could now appreciate that graph databases could now run on a cluster. Although certainly not the first to run on a cluster, this focus was on real power in running end-user supplied graph algorithms on a lot of data.

Kryptonite: The database decidedly went on its own with regard to standards. Some shortcomings on the simplicity of leveraging ontologies, performing inferencing and make use of your people who know either SPARQL or Cypher are apparent. By far the biggest disadvantage to this proprietary graph is that you have to think more about schema and JOINs prior to loading data. The schema model is more reminiscent of a traditional database than any of the other solutions on the market. While it may be a solid solution for running graph algorithms, if you’re creating a knowledge graph by integrating multiple sources and you want to run BI-style analytics on said knowledge graph, you may have an easier time with a different solution.

Interesting to note that although TigerGraph’s initial thinking was to beat Neo4j at proprietary graph algorithms, TigerGraph has teamed up with the Neo4j folks and is in the early stages of making its proprietary language a standard via ISO and SQL. Although TigerGraph releases many benchmarks, I have yet to see them release benchmarks for TPC-H or TPC-DS, standard BI-style analytics benchmarks. Also, due to a non-standard data model, harmonizing data from multiple sources requires some extra legwork and thought about how the engine will execute analytics.

Graph Superhero: RDF Analytical DB with labeled properties

Example: AnzoGraph DB

Superpowers: Cambridge Semantics designed a triple store that both followed standards and could scale like a data warehouse. In fact, it was the first OLAP MPP platform for graph, capable of analytics on a lot of triples. It turns out that this is the perfect platform for creating a knowledge graph, facilitating analytics built from a collection of structured and unstructured data. The data model helps users load almost any data at any time.

Because of the schemaless nature, the data can be sparsely populated. It supports very fast in-memory transformations, thus data can be loaded and cleansed later (ELT). Because metadata and Instance data together in the same graph and without any special effort — sure makes all those ELT queries much more flexible, iterative and powerful. With an OLAP graph like AnzoGraph DB, you add any subject-predicate-object-property at any time without having to make a plan to do so.

In traditional OLAP databases, you can have views. In this new type of database, you can have multi-graphs that can be queried as one graph when needed.

Kryptonite: Although ACID compliant, other solutions on the market might support faster transactions due to the OLAP nature of this database’s design. Ingestion of massive amounts of transactions might require additional technologies, like Apache Kafka, to ingest smoothly in high-transactional environments. Of course, like many warehouse-style technologies, data loading is very fast and therefore batch loads are very fast.

Other types of “Graph Databases”

You can get graph database capabilities in an Apache Hadoop stack under GraphFrames. GraphFrames works on top of Apache Spark. Given Spark’s capability to handle big data, scaling is a superpower. However, given that your requirements might lead you to layering of technologies, tuning a combination of Spark, HDFS, Yarn and GraphFrames could be the challenge.

The other solutions give you a nice taste of graph functionality in a solution that you probably have. The kryptonite here is usually about performance when scaling to billions or trillions of triples and then trying to run analytics on said triples.

The Industry is full of Ironmen

However, while evolution happens, remember that graph databases aren’t one thing.

A graph database is a generic term, but simply doesn’t get you the level of detail you need to understand which problem it solves. The industry has developed various methods of doing the critical tasks that drive value in this category we call graph databases. Whether it’s harmonizing diverse data sets, performing graph analytics, performing inferencing and leveraging ontologies, you really have to think about what you’d like to get out of the graph before you choose a solution.

The Startup

Medium's largest active publication, followed by +611K people. Follow to join our community.

Steve Sarsfield

Written by

VP Product, AnzoGraph (AnzoGraph.com). Formerly from IBM, Talend and Vertica. Author of the book the Data Governance Imperative.

The Startup

Medium's largest active publication, followed by +611K people. Follow to join our community.

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade