Graph Databases & Graph Processing

Sarthak Banerjee
11 min readApr 29, 2020

--

Graph computing or Graph databases are not new concepts. They had been there for the last 30 years or so. But, the last few years have seen exponential growth in interest around the topic especially after it has been established that the graphs are indeed the most efficient mechanism to solve some real life problems of the interconnected world viz. community detection, anomaly detection, fraud analytics, recommendations etc. Today, more than 7 of the top 10 retailers use graph databases to drive recommendations, promotions and streamlining logistics. Some of the largest insurers use graph databases to fight frauds and manage risks. Companies like Google and Facebook have built their businesses on top of some graph algorithms.

Some of the drivers for this renewed interest can be summarized as i) the price of storage and compute has come down, thereby making production grade deployment of graphs (which are compute and memory intensive) making financial sense ii) contribution from a lot of great open source projects in areas of graph database, graph querying & processing and graph visualization and iii) low learning curve.

According to a recent study by Gartner, the application of graph databases and processing will grow at 100% through 2022.

So what are the use cases for Graph ?

By its very nature the graph relates to the real world domain objects and its rich relationships. While the RDBMS also considers relationships, it’s implemented as a constraint leading to expensive JOINs and hence performance overheads once the number of relationships goes beyond a threshold. Graphs, on the other hand, allows a very flexible and extensible way of addition of new domain objects, relationships and properties. It provides enormous efficiency in storing, maintaining and traversing relationships. No wonder, the interconnected world of domain objects makes graph, an excellent choice for both OLTP as well as OLAP workloads including Machine Learning algorithms, especially those which are built around graphs. So, graphs enable connectedness across data points from multiple sources, helping us to understand influencers, create clusters and predict relationships.

You should look at Graph database/ processing, if your business use cases aligns to one of the following :

  • You are trying to detect communities or islands or clusters in connected data eg. trying to find households in a customer base
  • You are trying to find shortest path/ routes between two entities eg. supply chain route optimization
  • Predicting relationships between nodes eg market basket analysis, next best offers
  • Label propagation, ie to estimate the most important node in the graph, the influencer or the rank
  • Graph embedding for learning continuous feature representation for the nodes as vectors, used extensively for various downstream machine learning tasks. A good example can be Node2Vec which has been used for Recommendation Engines etc
  • Knowledge Graph or semantic representation of domain knowledge which can be queried by conversational bots

There are two very different types of Graphs — Labelled Property Graphs(LPG) and Resource Description Frameworks(RDF)

While both the approaches provide ways to explore and graphically depict connected data, fundamentally they are very different and have strength in different use cases. In LPG, vertices are also called nodes, which have a unique Id and a set of key-value pairs or properties, that characterize them. Edges or connections between nodes, also called relationships, have Id’s and properties. It is important to note that both nodes as well as relationships have an internal structure. On the other hand RDF has the notion of a ‘triple’ composed of 3 elements, subject-predicate-object. While the subject is a resource or node in the graph, the predicate is the relationship and the object will be another node or literal value. Both nodes as well as relationships are identified by a URI which is nothing but a unique label and hence neither nodes nor edges have an internal structure. You can find more on the differences between LPG and RDF here.

Most of the real life enterprise Graph use cases will fall under LPG and databases like Neo4j, TigerGraph, OrientDB, ArangoDB, Azure CosmosDB, MemGraph, RedisGraph, GraphBase comes under this category. RDF’s are mostly used for Knowledge Graphs where inference and reasoning is required from existing facts and text processing is a key requirement. Apart from this RDF’s are also good for representing complex metadata, reference and master data. Marklogic, OntoText, AnzoDB etc. are examples of RDF’s.

But before we go further, it is important to understand the difference between native and non-native Graph Databases

It is extremely important to understand how a database stores the data at the storage layer. The native graph stores like Neo4j, TigerGraph etc., stores both nodes and relationships data in fixed size data structures. This is called index free adjacency, whereby, each node maintains reference to the adjacent nodes. This speeds up storage processing and the computational cost is O(1). In contrast the non native datastores normally use a NoSQL or RDBMS as the underlying datastore and the adjacency list is stored as a JSON file as a name value pair. In order to speed up graph traversal, global indexes are often used to link nodes together, resulting in greater computational cost O(logn). So long the dataset is small, the underlying graph technology is not going to matter much, but in case the number of nodes are very high (millions to billions) native graph data stores will outperform non natives in real time storing, querying and traversing the graph.

The other database components that needs to be understood are File & Object Caching, Query Optimizer, Transaction Manager and High Availability & Clustering Support

Most of the graph datastores follow a similar architecture comprising of the following components :

  • File Caching — Similar to the SGA of a RDBMS where fixed number of blocks are cached from the file system
  • Object Caching — Which caches nodes, relationships and properties for rapid in-memory traversal
  • Optimizer — optimizes a query based on statistics, caching vertex and edge compression
  • Transaction Manager — Similar to the transaction manager of a RDBMS responsible for managing locks and writing to transaction logs
  • High Availability — the native data stores normally uses the master save architecture with the consensus protocol, while the NoSQL non natives achieves high availability using the HA of the underlying data store
  • Scalability — Graph partitioning is used whereby graph data is partitioned across multiple nodes

A brief comparison of some of the Graph Databases/ Processing Engines :

The Query Engine for Graph Databases

Most of the vendors have proprietary graph query engines. For example Neo4j has something called Cypher. Then there is GraphQL which is supported by DGraph and TigerGraph supports GSQL. But Gremlin, which is a part of Apache Tinkerpop (details later) is emerging as the de-facto standard for Graph query and traversal language. Many of the graph databases support Gremlin viz Amazon Neptune, Cosmos DB, Titan-Hadoop, JanusGraph, Neo4j, OrientDB etc.

Cypher is a declarative query language, whereby the user specifies what he wants and the engine decides how to get the data. However it is not turing complete which means, certain algorithms like Page Rank, Label Propagation cannot be implemented. In contrast, Gremlin is a functional data flow language, which uses Pipes to perform complex traversal, which means the user can define the traversal pattern. Gremlin is turing complete and has now become the defacto SQL for Graph Query Language.

Graph Processing Frameworks for OLAP workloads

Unlike Graph Databases, Graph Data Processing frameworks generally includes graph traversal to execute massively parallel graph algorithms. With the growing scale and magnitude of graph data has driven the development of new graph parallel systems viz. Apache Giraph and Spark GraphX. Most of these frameworks facilitate creation of an analytics pipeline including preprocessing of the data (load, transform, filter), graph creation, analysis & running the model, and carrying out post processing tasks all in memory. The advantage of these frameworks over traditional analytics approaches is their ability to run the tasks much faster over huge amounts of data. The Graph Processing frameworks generally uses a Distributed File System like HDFS or any Data Store built on top of it (NoSQL) or a full fledged Graph Database like Neo4j.

The Apache Spark’s GraphX project combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework. At a high level GraphX is the Spark API for Graph Parallel computation. With the ability to partition and distribute graphs, it can efficiently execute sophisticated in-memory graph algorithms in order of magnitudes faster than data-parallel systems. Because it is based on RDDs, which are immutable, graphs are immutable, distributable and fault tolerant. Thus GraphX is unsuitable for graphs that need to be updated. The graph is partitioned across the executors using a range of vertex partitioning heuristics. It also allows us to cache and un-cache the graph data to avoid recomputation when we need to call a graph multiple times. It provides native support for Scala. However Java API’s are also available which may not be that convenient.

Apache Giraph, is an iterative graph processing framework built on top of Hadoop’s Map Reduce and is based on a paper published by Google about its own graph processing system called Pregel. It follows the Bulk Synchronous Parallel (BSP) model, which comprises a set of processing components with fast local memory, interconnected by a communications network, a computation mechanism called supersteps. Each processor can perform concurrent computation independent of the other processors, exchanges data between the processors to facilitate data storage and follows barrier synchronization, i.e. when a process reaches a particular point in processing, it waits for the other processes to reach the same barrier. The Giraph jobs are essentially map reduce jobs which can be split across multiple servers to execute in parallel, and supports Java as the programming language.

In 2013, Facebook was able to use Giraph on top of its Hive infrastructure, whereby 200 commodity hardware were used to run an iterative page rank on a social network of 1 trillion edges in under 4 minutes. Also, a monthly active user dataset of 1 billion input vectos with 100 features were clustered into 10,000 centroids using k-means in less than 10 minutes.

However, Giraph was tuned to a considerable extent to achieve these numbers, which included i) using HiveIO to read the graph from Hive ii) using multithreading to the map reduce tasks for loading, computation and storing the computed results iii) optimizing the JVM memory by serializing the vertex and edges as byte arrays rather than java objects iv) moving from a centralized aggregator architecture to a shared aggregator model by using Netty.

In 2016, Facebook also carried out a comparison between Giraph and Spark GraphX, whereby it was established that Giraph has the ability to process 50x times larger graphs than GraphX, , being more memory efficient needing lesser machine hours to process the graph of the same size. However, from a usability angle, GraphX was found to offer a SQL like query for reading data from Hive, which allowed arbitrary column transformation, something which needs extra programming effort while using Giraph.

Graph Visualization

Most of the Graph Databases do not support a visualization engine and the same is left to java script libraries or some specific tools. There are certain java script libraries viz. D3.js, Prpoto.js (build on top of D3), vis.js, sigma.js, vivagraph.js, cytoscape.js etc. Most of these libraries needs to be embedded with the applications and need development effort to create the right visuaization. Apart from this there are some standalone product tools that can connect with a Graph Database and interact with the stored data without any code, keeping in mind the business users. Some of the examples are GraphXR by Kineviz, yFiles by yWorks, Linkurious, Graphistry, Perspectives by Tom Sawyer and Keylines by Cambridge Intelligence.

However for large graphs (more than 10k vertices or edges), tools like GraphViz, Gephi, igraph, Largeviz, Gaphistry etc are used.

The TinkerPop Framework

Apache TinkerPop™ is an open source, vendor-agnostic, graph computing framework distributed under the commercial friendly Apache2 license. The project primarily comprises of the following key components :

i) Gremlin, which is a graph traversal (query) language

ii) A set of programmable interfaces which enables providers of Graph databases, processing frameworks and visualization engines to build systems that are TinkerPop enabled and allows application programmers to talk to those systems using the Gremlin queries. TinkerPop enables those systems to expose its implicit graph structure within the lexicon of vertices and edges.

iii) The Gremlin server — it is a front ending server which allows user to submit traversals for remote execution over a web-sockets based binary protocol or using HTTP based API. It also provides support for Ganglia/Graphite/JMX monitoring metrics, and a traversal routing framework for intelligent data/traversal co-location within a distributed graph database’s machine cluster.

iv) Gremlin traversal machine — Every Gremlin language variant compiles to a language agnostic bytecode representation. That bytecode is ultimately translated to a machine-specific traversal. It is the responsibility of the Gremlin traversal machine to execute that traversal as a real-time OLTP query or as an analytic OLAP query (or both). Note that the Gremlin traversal machine is not bound to the Gremlin language. Any language can take advantage of the the Gremlin traversal machine by simply translating itself to Gremlin bytecode.

Most of Apache Tinkerpop has been developed using Java 8 but there are also bindings available for many other programming languages such as Groovy and Python.

The various Graph Databases and Visualization engines that support TinkerPop Gremlin are Amazon Neptune, Azure CosmosDB, DataStax, JanusGraph, Neo4j, OrientDB, Linkurous, Tom Sawyer Perspectives etc. Among the processing frameworks, Spark has implemented the SparkGraphComputer of TinkerPop, whereby Gremlin traversals will execute directly above Spark.

Graph Implementation Reference Architecture

The Graph Database can be used as a processed data store. The data from various source systems will land in the Data Lake through the Ingestion/ Integration layer or fed into a stream processing layer. The data landing on the Data Lake will be fed into batch processing pipelines. Either way, the data, after processing will be added to the Graph or the non-Graph processed datastores, depending upon the data element and the target schema. The processed data stores can be viewed as a Polyglot Storage which can loosely be termed as the next generation Data Warehouse.

The Machine Learning layer will use Graph Processing frameworks and fetch data from the Graph or non-Graph data stores and create/ update models (next best recommendations) which can be deployed as API’s and consumed by applications. The offline model outputs (eg. clustering) can be fed to downstream processing applications (eg. campaign management).

The applications can consume machine learning models or carry out OLTP transactions using the API layer.

The choice of the Graph Database will depend upon the type of workload (OLTP or OLAP), the size of the Graph (number of vertices & edges), concurrency, throughput and latency requirements and other considerations like high availability & replication. It is suggested to carry out a benchmarking exercise before choosing the right data store.

--

--