What’s advancing the world of graph databases, and what’s slowing us down?
In my relatively short two years in the graph database world, I’ve met people who have spent decades hoping that the world sees value in the analytics, semantics, and data integration that graph databases offer. Many companies are still waiting for the market to pop, all the thirty or so vendors. What are some of the hits that make graph databases so cool, and what are the misses that are keeping graph databases niche?
Here is my take on the top HITS and MISSES of the graph database world, covering a little bit about investors, technology, standards, and marketing.
Hit: Investor Interest in Data
We can learn a lot about the graph database market’s capability to woo investors by looking at Snowflake. Although clearly NOT a graph database, it makes for a good study. In its IPO, Snowflake raised money at a $12.4 billion valuation. Revenue in the first half of 2020 more than doubled to $242 million. Through this fantastic IPO, the company was still losing $171.3 million. Adoption is strong, with over 3000 customers on the platform.
Snowflake is reasonable, but not necessarily amazing technology. Sure, the analytical engine underneath Snowflake can automatically scale on the cloud, but it is not as configurable, agile, and fast for certain types of analytics as say Vertica, Redshift, or even Teradata. It’s a comparatively young platform, and believe me, it takes years to build out optimizations and periphery use-case analytics. Backup, workload management, encryption, security aren’t easy to build, but necessary to sell into corporate IT. Frankly, the more mature vendors have had time to build them out.
Is enthusiasm is overdone? Maybe not. Much of Snowflake’s valuation comes from cool branding and marketing, and an outstanding user experience that drives an amazingly fast adoption pace. Compared to Snowflake, the notion of business analysts having to go through an IT process to spin up servers, even cloud ones, and then wait for IT to install the right software is inferior. In the SaaS model it provides, it’s a few clicks, and I’m ready to load my data. It simplifies the entire analytics creation process, from data load to analytics.
Hit: Focused adoption to a broader audience
Graph databases should be easy to use, no matter what level of expertise you have. The HIT is that vendors are driving toward a Snowflake-like experience. Enter your contact details, give a couple of technical details about your project, and you’re up. There are demo databases for you to use and queries for you to run. There’s some very encouraging work from Data.World, TigerGraph, and Neo4J Aura in the elegance department will only help the market grow.
On the other hand, if you think your graph database’s target market is ontologists, think again. Aim your development at Excel spreadsheet and internet browser users. Less knowledgeable users will thank you for helping them understand graph. More knowledgeable users will thank you for saving them from the dreariness of following complicated instructions for setting up a database.
Think of it this way. During the deal, your potential buyer will have to justify why they need this new thing called a graph database to someone on the business side. If the solution can’t be elegantly showcased and understood by all, it may fall flat with business users. Technical features have an equal footing to business-focused ones.
Miss: Standards delays on SPARQL*
For more than a decade, the W3C has had a fantastic plan to expand a significant shortcoming in triple-stores so they can express properties on predicates. This would mean that expressing provenance and start/end dates for relationships, for example, in a semantic graph would be as simple as it is in a property graph. The committee is a mixed bag of semantic database vendors, some who want to build toward unification with other labeled property engines, using a similar structure and syntax to Cypher-based solutions called labeled property graphs. Others would like to continue to pay homage to blank nodes and reification in a part of the specification termed separate assertions.
The divide between the blank nodes/reification guys and the labeled property guys is what stumped the W3C more than a decade ago, and I fear it’s happening again. You can see why it might happen. If your database engine is optimized for one or the other method, or if you belong to the exclusive club who completely understands reification, it’s right to protect your interests. However, it’s not so good for expanding the market to analysts. As the years tick on, we can’t seem to get it done — meanwhile, the world’s technologies advance.
Some vendors are taking matters into their own hands and building one method or another, or both. It’s a shame that the tools vendors who offer visualization or integration into graph databases will have to suffer from a lack of a standard. Without universal tools, the graph database adoption will suffer.
Hit: Bridge-building with GQL
Although I have not been as actively involved in the GQL standards committee meeting as I have with the SPARQL* conversation, the concept of GQL as an extension of SQL makes a lot of sense. While we are all familiar with SQL-99, the ISO SQL standard continues to evolve thru its current evolution of SQL-2016. In this concept, GQL will become an extension of SQL.
As I previously mentioned, I believe that analysts are potentially the biggest buyers of the analytical, semantic, and data integration prowess of graph. If properly realized, GQL will provide the necessary bridge from SQL to graph and offer a bevy of visualization and data integration partners who can support it. Neo4J and TigerGraph are driving this standard, but it has broad support with many property graph vendors.
Miss: The category called “Graph Database”
I became interested in graph databases shortly before I joined Cambridge Semantics. Google searches at the time gave me a lot of information about graph theory and the bridges of Königsberg problem, descending into MIT professors’ lectures about graph theory. Did I need to be Good Will Hunting to understand graph databases? Eventually, I discovered that graph theory is merely the intellectual’s entry point.
The ability to solve graph theory problems is a feature/benefit of a graph database, but not the only one. Many benefits come first — especially data integration and relationship analytics — and I’ve always thought that “graph” shouldn’t be the headline in this category. It’s akin to calling the local supermarket, stocked with hundreds of items, a mustard store. Yes, they have mustard, but…
Could we gain a quicker understanding by calling our technology a three (and sometimes four) column database, a “story” database, or even a “sentence” database? You’re storing subject-predicate-objects, which serve as the subject, verb, and object of the sentence with properties acting as adjectives and adverbs. There are a lot of advantages to storing sentences. As long as the sentence is grammatically correct, you can store anything about anyone at any time without having to worry what table. The sentences form a story about you and your organization that can be easily augmented with other epic stories that have already been written like FIBO and HL7.
That ship has sailed.
Miss (and Hit): Wide range of graph databases
Vendors in the graph database category are as different as chalk and cheese. I think this drives confusion among the populace. The differences include the following:
· Languages — Graph databases might speak SPARQL, SPARQL*, Cypher, GQL, Gremlin, GraphQL APIs, others
· Inferencing/OWL/Semantics — This is a powerful feature set for handling relationship analytics in graph databases, but not all of them are great at it. So, do graph databases do inferencing, or not? It depends on the graph.
· APIs — Accessing a graph database is wildly different between vendors. There are SPARQL endpoints, REST, BOLT, custom APIs.
· Scaling — The solutions have different scaling capabilities, just like the difference in traditional RDBMS. MySQL scales differently than Teradata and the same is true in the graph database space.
The vast difference in graph databases is a problem for the category. If there are many methods to talk to your graph database, many languages, and the graph databases provide different functions, then is it a unified category? If you are a corporation in search of graph-powered analytics, how can you possibly decide? If you’re a visualization partner and want to support graph databases, how exactly would you do it?
If you are an innovator in your company and looking to explore graph technology, don’t just try Neo4J alone and form an opinion. The market is full of technologies that are almost nothing like anything else in the graph database market. For its diversity in technologies, I’ll say it’s a hit, but for its confusing variety, a miss.
Hit: My Time in the Graph Database Market
In recent personal news, I have just departed from Cambridge Semantics as a VP Product. In the last two and more years, I’ve been privileged to work with many very talented and intellectually honest people at CSI, from whom I’ve learned a very great deal. Thanks very much to CSI and the graph community for accepting a data quality, data integration, and big data guy into the fold.
I’m not sure what the future holds, but I’d like my next role to be one that continues to push forward the working of data for a greater good. There’s so much work we can do with data analytics to further humankind, be it genome research and healthcare, animal conservation, efficient energy use, hunger, and so on. Let’s use data and analytics for driving humanity forward.