Examining Common Myths about RDF, LPG and proprietary graph solutions
There is folklore everywhere, even in the relatively new area of graph databases. Some of the myths you may hear were true several months or even years ago, but this is a fast-moving graph database market. Let’s examine a couple of yesterday’s myths and today’s truths for comparison.
Myth 1: There is only one “scalable graph database” in existence
Despite marketing claims, there is NOT only one scalable graph database in this world.
The scale of analytical functions is perhaps one of the biggest differences between vendors of graph databases. For analytics, a couple of vendors support multi-hop and multimode deep analysis of a lot of data. The majority of vendors are more transactional, better at INSERT/DELETE/UPDATE operations. If you’re scoping out a graph database for more than just graph algorithms, be sure to put it to the test with your mixed workload before making the choice.
Vendors often use benchmarks to prove scalability, but scalability is better when it’s linked to your unique workload than a mythical one. Besides, there are things you should compare that don’t always make it into the benchmarks:
- Data Loading — Graph databases vary a great deal in how fast they can load data. Some chunk away at loading, using relatively few compute resources and loading data very slowly, while others let you break up the task across a cluster and use it all for speedy results. Some graph databases load data offline much more quickly than online. Consider your use case and how fast you need to load data.
- RDF data-wrangling — If you use something other than an RDF triple store to load the thousands of standard RDF files that are available for your use, you’re probably going to spend time wrangling data and you should account for that time. For example, data stores like Wikipedia offer RDF that’s freely available, but need to be translated for a proprietary or LPG database because they don’t follow standards. It’s an extra step, albeit a small one, to load a standard data file.
- Compiling and Caching Queries — Some graph databases compile queries to speed up analytics. Rarely is this compile step accounted for in benchmark reports. However, in real-world analytics, you may have to compile your queries frequently. Take this into account, especially if your analytical workload isn’t the same thing every day.
- Non-Graph Analytics — Often compared in benchmarks are graph queries, but there is so much more to it. See the next myth for more detail.
Your database selection should be based on real-world analytics and the task at hand.
Myth 2: Graph databases are primarily for “graphy” analytics like friend-of-a-friend and PageRank
Graph databases are known for their ability to perform graph algorithms, but many people (and even a few companies in the space) confuse graph algorithms with a graph database.
Sure, if you want to do PageRank or shortest path analytics, graph databases are the best solution and best performing solution. For those who are trying to do PageRank and other graph algorithms in an old-school RDBMS, it’s a tedious process that often involves moving data, performing big JOINs and running compute-expensive aggregate functions. No matter if you’re using an RDF triple store or a labeled property graph (LPG), you can speed through graph algorithms with just about any graph database.
However, the power of the graph database lies in certain aspects of the database’s capability to do inferencing, handle ontologies, understanding linked data and even data warehouse-style BI analytics. Many RDF databases support logical reasoning during application runtime to answer queries about facts that have not been explicitly saved. Inferencing capabilities can create new relationships and insight based on the vocabularies or ontologies in the existing data. There are many, many ontologies to help you gain insight. For example, a popular one is the FIBO ontology or Financial Industry Business Ontology. FIBO is a formal model of the legal structures, rights and obligations contained in the contracts and agreements that form the foundation of the financial industry. It can go far in helping you manage data and gain insight into a financial services organization.
Some graph databases can also perform standard data warehouse-style analytics. SPARQL, Gremlin, and for the most part Cypher, are all capable of what you might consider the standard analytics. When you are running analysis, you’ll need functions like aggregates, Count/Avg/Min/Max, ORDER BY and offsets, Functions on Strings, Numerics and Dates and Times. These available to you on some graph databases, just like an RDBMS.
Many professional cooks loath have a device taking up space in their kitchen that does just one thing. As in your kitchen drawer, you should think twice about having a database that is just about graph algorithms. These days graph databases are far from that niche and can do a lot more than graph algorithms.
Myth 3: You have to decide between a property graph and a triple-store. Either way, you have to give up something
So much has been written about property graphs versus triple stores. A good example is a write-up by Jesus Barrasa. This thought piece was originally presented at a conference in 2016 and later written as a DZone article in 2018. The article talks about some of the differences in data models and some of the shortcomings of RDF. This article was originally written in 2016 when it was most definitely true. It was somewhat true in 2018, but these days, it’s outdated.
Most modern RDF triple-store graph databases, including BlazeGraph, Amazon Neptune and AnzoGraph DB, have adopted the SPARQL* (pronounced ‘Sparkle Star’) standard which allows for property graphs and provides for most of the functions offered by the property graph databases. With modern triple-stores, you can have multivalued properties on a triple. You can use either quads or properties for named graphs. Yes, you can qualify instances of relationships because the modern RDF supports properties, and so on.
RDF triple-stores still shine when it comes to supporting standards, however. Standards were a big motivation years ago when Tim Berners-Lee et al. decided that data should be shared on the web in a standardized format called RDF. Many large corporations and government agencies still use the RDF model and will likely adopt the RDF* model once it becomes official.
RDF triple-stores are still better at OWL ontologies and inferencing than other solutions, however. Inferencing is an automatic procedure that can generate new relationships based on the data and based on some additional information in the form of a vocabulary.
Are there more myths?
So much of the focus of graph databases is on graph algorithms, mostly because the money in the market is pushing this concept. However, graph databases are far more valuable than finding friends of friends on social networks. As my colleague Sean Martin writes, data integration is the killer application for graph databases. There is power in being able to manage multilayer network topologies, leveraging OWL and ontologies for additional intelligence and enriching your data.