Cambridge Semantics Gets on the LPG Track

Will Labeled Property Graphs (LPGs) make SPARQL go the way of the steam engine?

Disclaimer: It is against my employer’s policy to endorse any specific vendor. This article is only a general observation of trends in the databases industry.

In the early 1900s the Pennsylvania Railroad was one of the greatest engineering companies in the world. It employed tens of thousands of engineers building the most powerful steam engines in existence. However, during the 1940 a new technology started to threaten the steam engine: the diesel powered locomotive. In a last desperate effort to retain their leadership in steam technology, the Pennsylvania Railroad built the most powerful steam engine in history, the legendary T1 locomotive. The T1 truly was an amazing engineering feat, easily breaking speed records of other steam engines by wide margins. Yet there was a problem: the engine was so complex that it required extensive maintenance. The engines were sometimes in the shop for maintenance more then they were being used. Although the T1 was an engineering marvel, in practice they were uneconomical to maintain.

The decline of the once mighty Pennsylvania Railroad is frequently attributed to their reluctance to adopt diesel technologies. Their management underestimated the long-term technology trends and they took too long to adopt the newer low-cost diesel engines.

I think the T1 Locomotive story is a metaphor for our times in the databases world. Organizations are starting to realize that their older RDF and RDBMS systems are going the way of the steam engine. They will be an interesting footnote in the history of databases but they won’t play a significant role economically or in providing a foundation for innovation. The new diesel engine is the labeled property graph (LPG). In the end, it will be the lower-maintenance costs that will show LPG data models are superior to reification-bound RDBMS and RPG database models. Not convinced? Can anyone add a property to a relationship at any time without disrupting the existing query pool. With LPG models this is trivial.

Cambridge Semantics was in the news last week as they announced their transition from RDF-only to support both RDF and LPGs. I personally feel this will be one of the biggest announcements in the database industry this year. For anyone that is not familiar with Cambridge Semantics, they are one of the few companies that supports the concept of a distributed graph on commodity hardware. That means that their systems are not limited to a single node in a cluster. Their graphs can span hundreds of nodes in a cluster, each with terabytes of RAM.

Although many people in the industry were not aware of this, Cambridge Semantics has actually supported the LPG data model for a while. However, you could only access properties of relationships through the SPARQL* language. The announcement last week concerned the fact that they will now also support the OpenCypher query language. This is an important step since Cypher is by far the leading LPG query language and embodies many of the concepts that need to be present in the future GQL standard.

The OpenCypher work at Cambridge Semantics is being headed by Barry Zane. Barry is one of the co-founders of Netezza (sold to IBM), ParAccel (sold to Amazon) and SPARQL City (now part of Cambridge Semantics). You can infer from his lineage that Barry is no stranger to distributed query optimization. Barry and his team are also active participants in the LPG query language standardization efforts. I think it would be wise to not underestimate what Barry and his team can do.

For RDF/SPARQL-only vendors that don’t yet have an LPG/Cypher product yet, the Cambridge Semantics announcement is significant. It means that you will need to complete with an new generation of companies that leverage a massive community of Cypher developers and a huge and growing library of graph query algorithms written in Cypher not to mention training and trade-press books that document these algorithms.

In the next year we expect to see a new generation of companies that will be building add-on-products to the LPG-Cypher-GQL stack. Common functions like rules engines, recommendation engines, entity resolution engines etc. will be created and supported by a new generation of innovative third parties that are all on the LPG-Cypher-GQL track.

Companies that have graph products that are not truly distributed also need to take note and remember the lessons from Intel Saffron. Your graphs can’t just rely on public-domain similarity algorithms. They also need to scale out. There are now multiple vendors that support distributed native graph databases. My suspicion is there will be more announcements in this area in the following year.

In the coming years we will see an entire new generation of FPGA/VLSI engineers that understand that we don’t need 1,500 instruction in our CPU to do simple pointer hopping. Companies like Cray, Data Vortex and Graphcore are well on their way to getting the custom graph hardware to work. That means that graph developers will have yet-another three-orders of magnitude lead over their RDBMS JOIN competition. Theses vendors just need to add the Cypher layer to be part of the growing community of embarrassingly parallel graph algorithms written in Cypher.

The next few years promise to be full of change. I hope these articles can help you lead your company down the right track!