The Art of Compromise

A Deep Dive into Knowledge Graph Technologies

Finding an optimal knowledge graph solution

Kasper Piskorski
Beamery Hacking Talent

--

Photo by fabio on Unsplash

At Beamery, we are pushing the boundaries of how modern recruitment is being done. Our domain of interest can be characterised as:

  • a social domain of people, their skills, work experience and education history.
  • the domain is highly interconnected. Going beyond non-local information requires multi-join patterns. These are naturally and intuitively expressed as multiple hops in a graph.
  • we have disparate data sources with varying data formats and quality. These need to accessible as a single source of truth.
  • the domain contains billions of entitites. These need to be accessible in real-time.

Therefore, we have recognised that building a “Talent Knowledge Graph” will allow us to effectively operate in our domain as well as to fully unlock the predictive power of artificial intelligence and provide unique insights from our heterogeneous data sources.

After Google’s reveal of its Knowledge Graph platform in 2012, the term has been rapidly growing in popularity. After adoption of various knowledge graphs by prominent tech companies, the technology is now receiving widespread attention from the industry as well as from the research community. At the core of knowledge graphs lies the idea of providing a unified and intuitive abstraction of a domain by applying knowledge representation techniques to data structured in a graph form. The roots of the idea can be traced back to knowledge-based and query answering systems from the 80s. The attractiveness of the approach is most often appreciated in use cases requiring integration, management and information extraction across heterogeneous data sources at scale.

For highly-interconnected and complex domains, graph representation provides a natural and intuitive description. The connection density present in social, biological or chemical data can be effectively abstracted and accessed using graph-based data models. The noSQL graph-based databases, apart from the attractiveness of graph representation, offer model flexibility absent in traditional relational databases. Such flexibility is potentially required in scenarios where the data model is not strictly defined and rapidly evolves in time. The flip-side of the extra flexibility is the lack of control over data — data integrity and validity is enforced to a very narrow extent. Finally, by adopting the schemaless approach, the user only postpones the definition of a database schema and pushes the data control aspect from the database to the application layer.

Four comparison pillars

The decision to build a knowledge graph opens up a range of possible techniques and tools to explore. Well-known knowledge representation approaches such as the definition of schemas and ontologies can be used to provide explicit semantics to arbitrary domains of interest. Different graph query languages go beyond supporting operators of relational algebra to include navigational operators offering the ability to search recursively for entities through paths of arbitrary length. Various frameworks for graph analytics are available and were proved to scale for standard graph operations. Finally a range of machine learning techniques has been studied and devised to provide learning capabilities over graphs.Having that in mind, we can compare various potential knowledge graph solutions. We grouped our comparison criteria into four comparison pillars:

  • semantics
  • query language
  • data ingestion
  • data retrieval

Each of the pillars can possibly consist of more granular requirements that we will cover when describing each pillar in detail. We will compare four different solutions in view of satisfying these criteria. The solutions under considerations are Neo4j, TigerGraph, Stardog and Virtuoso. Consequently we are comparing two property graphs, a quad store and a multi-model database. We will first concentrate on their different semantic offerings as well as supported query languages whereas the latter two pillars will be examined by utilising a benchmarking framework.

Semantics

The rise in popularity of knowledge graphs was matched by the emergence of various modern graph database solutions. The most popular graph data models are the directed edge-labelled and the property graph models. The directed edge-labelled model defines a set of nodes and a set of directed labelled edges between them. The model offers a possibility of describing incomplete information by omitting an edge of interest. A standardised directed edge-labelled graph model is the Resource Description Framework (RDF) specified by the W3C. The property graph model was designed to offer additional flexibility when modelling data and more complex relations as a graph. This was achieved by generalising the edge-labelled model via introduction of node labels and property-value pairs attachable to both nodes and edges. As a result property graphs can be simply translated from directed edge-labelled graphs. With some extra remodelling steps, a lossless inverse translation is also possible.

A common reader can arrive at a conclusion that oftentimes graph DBs seem to be treated synonymously with Knowledge Graphs. Upon closer inspection however we can conclude that the key differentiating factor between a regular graph database and a knowledge graph lies in the semantics. The difference between an off-the-shelf graph database and a knowledge graph typically lies in the difference between how we understand data, information and knowledge.

Let’s say we are given a preexisting collection of nodes and edges. Some of them are annotated with numbers, some with strings, some with other data types. Even with the annotations, we will struggle to extract meaning out of it. The annotations will be lacking context leaving us to wonder whether a node annotated with a label name corresponds to a person’s surname, general name, identifier or something else. Of course we might then look at the data itself and try to infer the meaning — we might try to interpret the data, which may confirm one of our initial assumptions. The process of interpretation is what makes data information and later knowledge. It’s the process of assigning meaning. It’s typically interleaved with another process which is elaboration. Now say we interpreted that some nodes describe some employee data. Additionally we find that other nodes correspond to companies and edges between them represent hiring information. Now the real value and usefulness of this graph would strongly depend on how much control over the shape of the data we have and what kind of data guarantees can we give. Let’s say we expect the contractual information to describe which companies hire which people. There is nothing stopping us from taking some abstract node, say one corresponding to a salary, and linking it to a company via an edge labeled Hires. The ability to do that is determined by how low-level our semantics are and how well they can be enforced.

The collection of nodes and edges represented using one of the graph data models can be referred to as the data graph. The transition to knowledge graph is made by introducing facilities defining explicit high-level semantics: schemas and ontologies. These can be either natively defined as a part of the data graph or imposed via an extra layer above it. Data graphs benefit from extra flexibility over the schema-first relational models. The definition of schema can be abandoned or postponed depending on requirements and circumstances. The key advantage of the data to knowledge transition is the fact that knowledge can be reasoned about.

Consequently, we have identified the following granular requirements tied to the semantics pillar. These can be grouped into two categories: Data Modelling for ones concerned directly with the data model and available data structures, and schema definition and enforcement, describing high-level semantics and their execution.

Semantics: Data modelling

  • ability to express complex relationships between entities: multi-hop, multi-join, and many-to-many, etc.
  • support for standard SQL/Mongo datatypes
  • support for element and structure annotations
  • identity disambiguation mechanisms
  • data contextualisation mechanisms

Semantics: Schema definition and enforcement

  • definition of a high-level schema with explicit semantics by using high-level abstractions such as aggregation, classification, instantiation, hierarchies and inheritance
  • definition of high-level constraints
  • providing attribute value restrictions (bounds, regexes, etc.) for ingested data
  • providing schema consistency (conformance to schema) guarantees for ingested data
  • consistency checking/validating schema

Virtuoso

Data modelling:

Virtuoso is a cross-platform RDBMS for SQL, XML and RDF data. Virtuoso treats its databases as a single logical unit and provides unified access to data, which reduces the cost of integrating data from disparate data sources. It offers full support for SPARQL and RDF which is tightly integrated with its relational engine. Consequently Virtuoso combines RDBMS functionalities with those of a RDF quad store. As part of its SQL/RDF offering it allows mapping relational data to RDF using RDF views — virtual RDF graphs can be defined and queried using SPARQL based on relational DB content.

Schema definition and enforcement:

Virtuoso offers standard relational schema definition facilities using SQL’s DDL capabilities. It also supports XML schema and Document Type Definition (DTD).

Tigergraph

Data modelling:

Tigergraph is a native graph database following the property graph data model. It represents data in terms of vertices and edges (referred to as relationships) with attachable properties (referred to as attributes). TigerGraph defines edges between vertex pairs where the two vertices are referred to as the source vertex and the target vertex. An edge can be either directed or undirected, where a directed edge defines its direction from the source to the target vertex. Relationships are strictly binary. TigerGraph differentiates the following data types of attributes:

  • primitive: INT, UINT, FLOAT, DOUBLE, BOOL, STRING
  • advanced: STRING COMPRESS, DATETIME, FIXED_BINARY(n)
  • collections: list, set
  • user-defined n-tuples.

TigerGraph does not support annotations for composite graph elements. The single identity disambiguation mechanism provided by TigerGraph is allocation and assignment of unique UUIDs. The single contextualisation mechanism of TigerGraph is the definition of multiple graphs. These are equivalent to different databases.

Schema definition and enforcement:

Tigergraph strictly requires both vertices and edges to be typed — it follows the schema-first approach. Consequently it requires the user to design a graph schema to fit the source data a priori any data ingestion. The initially designed schema is modifiable — TigerGraph allows the user to make changes to an existing schema. The graph schema is specified by defining node and relationship types. Each node and relationship can strictly have a single type only. A node type is defined by specifying the type name and a list of one or more node attributes. Additionally for each node type it requires the definition of a primary_id attribute whose purpose is to uniquely identify each vertex instance and to offer constant time lookup. Its data type may be STRING, INT or UINT. The only possible attribute value restriction is the one of a default value. Every attribute data type has a built-in default value which can be overridden on definition. An example definition describing the node type of a Person reads:

CREATE VERTEX Person (
PRIMARY_ID id UINT,
firstName STRING,
birthday DATETIME,
speaks set<STRING>
)

This definition of a node type is reminiscent of a SQL table. In fact the node type can be interpreted as a table name with primary_id corresponding to a primary key with attributes mapping to table fields. Relationships are defined by specifying relationship type name and vertex types of the source and target vertices. Similarly to node types, relationship types allow definition of attributes. Consequently, a definition of a KNOWS relationship between different people with a specifc start date would read:

CREATE DIRECTED EDGE KNOWS (FROM Person, TO Person, creationDate DATETIME)

The requirement of relationships strictly connecting specific pairs of types necessitates in creating multiple relationship types in the event of wanting to relate more than two node types. For example, if we want to express that people can like both comments and posts, we would need two relationship types:

CREATE DIRECTED EDGE Person_LIKES_Comment (FROM Person, TO Comment)
CREATE DIRECTED EDGE Person_LIKES_Post (FROM Person, TO Post)

By design schema definition and data insertion disallows creation of nodes without types. Additionally, Tigergraph strictly disallows creation of connections between node types that do not conform to the schema. Only vertices and connections allowed by the schema can be successfully inserted into the graph. Tigergraph does not support any high-level constraints or attribute value restrictions.

Stardog

Data modelling:

Stardog adopts the RDF data model — the edge-directed labelled graph model. In this model edges are defined in terms of subject-predicate-object triples where subjects and objects are nodes and predicate defines the relationship. In fact nodes are not defined independently, instead they are defined via triple definitions. As a result, all nodes must be members of a triple. A given node can assume subject or object position in different triples. Consequently, an RDF graph is defined as a set of triples. RDF differentiates three types of nodes:

  • IRIs — unicode strings used for unambiguous identification of nodes and edges
  • blank nodes — nodes without user-visible identifier
  • literals — constants representing specific values: string, numbers, etc. Literals can only assume the position of an object in triples.

As literals cannot have any outgoing edges, IRIs and blank nodes are jointly referred to as resources. The supported data types for literals include common datatypes covered in XML Schema Datatypes (XSD). Custom datatype IRIs are also supported. The support of RDFS additionally allows definition of collection and container datatypes. The RDF model allows data contextualisation via its native namespacing facilities.

Schema definition and enforcement:

Stardog follows a fully customisable approach to schema definition and enforcement. This is a result of providing support for RDF Schema (RDFS) and Web Ontology Language (OWL). Consequently, the user can decide how strict structural guarantees are needed for his specific use case — for a given RDF graph it is possible to enforce strict explicit schemas in a localised fashion. The RDF Schema structures the schema information as an integral part of the RDF graph. As a result, schemas can be queried and interacted with in a similar fashion to data. RDFS revolves around two main concepts:

  • classes representing categorisation of nodes with shared or similar characteristics.
  • properties which represent relations between subjects and objects.

In contrast to object-oriented models, classes are not defined in terms of properties they have. Instead properties are defined by specifying their domain and range — allowed classes that can participate as subject and the object of the relationship respectively. This allows strong typing of relationships. Consequently, to define a class of Person that participates in a knows relationship and is characterised by a firstName and birthday properties as well as languages it speaks , we can write:

PREFIX : <http://beamery.com/example/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
:Person a rdfs:Class .:birthday a rdf:Property ;
rdfs:domain :Person ;
rdfs:range xsd:date .
:knows a rdf:Property ;
rdfs:domain :Person ;
rdfs:range :Person .
:firstName a rdf:Property ;
rdfs:domain :Person ;
rdfs:range xsd:string .
:speaks a rdf:Property ;
rdfs:domain ex:Person ;
rdfs:range xsd:string .

In terms of consistency and integrity checking, Stardog offers a wide range of options via its Integrity Constraint Validation (ICV) system. The system validates RDF data according to user-prescribed constraints. The constraints can be defined via SPARQL, OWL, Semantic Web Rule Language (SWRL) or Shape Constraint Language (SHACL). The ICV system allows validating the complete graph as well as subgraphs defined via named graphs. Additionally it provides transactional constraint checking leading to failure of transactions containing invalid data. Stardog additionally offers an ICV explanation facility to elaborate and explain the constraint violations as well as a violation repair facility allowing for automated integrity restoration.

With OWL and SHACL support, a variety of restrictions on properties and values is supported. These include:

  • class hierarchy (subsumption) restrictions,
  • class participation restrictions — require participation of specific classes in defined properties, e .g. each Person needs to have a birthday:
:Person rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :birthday ;
owl:someValuesFrom xsd:date
] .
  • property domain and range restrictions. Examples of these were included in our small Person schema.
  • value range restrictions, e. g. to express we are only interested in Persons that were born in the 2000s using OWL we can write:
:Person rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :birthday ;
owl:someValuesFrom
[ a rdfs:Datatype ;
owl:onDatatype xsd:date ;
owl:withRestrictions ([xsd:minInclusive "2000-01-01"^^xsd:date] )
]
] .

The same restriction can be expressed in SHACL by:

:Person
a rdfs:Class, sh:NodeShape ;
sh:sparql
[ sh:select """
SELECT $this ?age
WHERE {
$this :birthday ?birthday .
FILTER (?birthday > "2000-01-01"^^xsd:date) .
}"""
] .
  • cardinality constraints, e.g., to express the fact that each Person must speak at least a single language we can associate the Person class with the following OWL restriction:
:Person rdfs:subClassOf
[ a owl:Restriction ;
owl:onProperty :speaks;
owl:minQualifiedCardinality "1"^^xsd:nonNegativeInteger ;
] .
  • property constraints. If we wanted express that people that work together know each other we can write:
:worksWith rdfs:subPropertyOf :knows

Constraints are composable giving the possibility to define complex constraints.

Neo4j

Data modelling:

Neo4j is a schemaless native graph database adhering to the property graph data model. Neo4j allows to assign node labels to group nodes sharing similar characteristics into node sets. A given node can have zero or more labels. Edges are referred to as relationships. Edge labels are used to differentiate relationship types. Each edge must have exactly one label. Edges always have a direction. Labels can be added or removed during runtime. Consequently, definition of complex relationships between entities is possible, although limited to binary representations.

Noe4j allows assignment of properties in the form of name-value pairs to nodes and relationships. The supported property data types are:

  • primitive: Integer, Float, String, Boolean
  • spatial: Point
  • temporal: Date, Time, LocalTime, DateTime, LocalDateTime, Duration

Neo4j does not support storing collection types as attributes. Neo4j does not support annotations for composite graph elements. The single identity disambiguation mechanism provided by Neo4j is the allocation and assignment of unique UUIDs. The single contextualisation mechanism of Neo4j is the definition of mutiple graphs. These are equivalent to different databases.

In addition to the native property graph model, Neo4j offers RDF support via its Neosemantics plugin. The plugin allows lossless imports of RDF data into Neo4j as well as data exports from Neo4j in RDF format. The export and imports are facilitated by using a model mapping tool allowing to define the translation between the property graph and RDF models.

Schema definition and enforcement:

Neo4j is inherently a schemaless solution and does not offer any high-level abstractions. All graph elements are not typed and their classification is solely based on attached labels. The only consistency checking mechanism offered is the one of constraints. Constraints are used to quantify property uniqueness and property existence. Property existence can be constrained for both nodes and relationships. The constraint system uses node and edge labels to refer to objects the constraints are defined for. Additionally, constraints allow to define and enforce composite (multi-property) node keys. Constraints can be added to preexisting data. Neo4j does not allow any attribute value restrictions.

With the use of the Neosemantics plugin, additional partial support for validating schemas and inference rules is offered. Validating schemas are enabled by inclusion of partial support for SHACL. The SHACL support allows both full graph validation as well as subgraph validation for node sets and transactional validation. The last option effectively introduces support for partial semantic schema — transactions not conforming to constraints defined with shapes are rolled back. The Neosemantics support for inference rules involves five stored procedures (nodesLabelled, nodesInCategory, getRels, hasLabel, inCategory) intended to mimic edge and relationship type hierarchies. The rules define traversal extensions according to prescribed node and edge labels. This requires definition of extra quasi-type nodes and relationships between them to represent the type relations. Although this approach allows to define complex label hierarchies, it still exhibits only basic semantics of node and relationship labels.

Query Language

The four solutions under consideration support different query languages. Consequently these will be reviewed on the basis of the following characteristics:

  • navigational graph patterns (multi-hop, variable path length)
  • query composition and subqueries
  • query control flow
  • attribute functions and aggregations
  • subgraph retrieval
  • reasoning support
  • SQL completeness

Neo4j — Cypher

Cypher is a declarative graph query language designed to offer a convenient transition from SQL. It adapts the same clause syntax structure and retains the semantics of many of its operators. A number of its keywords are also borrowed from SQL. Cypher queries consist of clauses corresponding to paths in the graph. Cypher identifies vertices via their labels and prunes the search space by filtering vertices by their property content. The result of a Cypher query is a collection of tuples which can be manipulated in a similar manner to a relational table.

Navigational graph patterns: support for multi-hop patterns with variables with possibility of returning both endpoints as well as the edges. Cypher supports configurable variable path length patterns via the Kleene’s star operator. Both paths of arbitrary lengths are supported as well as paths with path length bounds.

Query composition and subqueries: Cypher ****supports linear query composition — results of subqueries can be used as inputs for other queries. Addtionally it offers the ability to filter query results by existential sub queries using WHERE clause as well as result-returning subqueries using CALL{} syntax. Cypher supports optional matching via OPTIONAL MATCH keyword. Query results can be combined via UNION operator.

Control flow: Cypher supports general conditional expressions via the CASE construct. It supports loops over elements of collections with the FOREACH keyword. It is possible to mimic some general iterative behaviour by using the UNWIND keyword over a predefined list as a loop iterator. No generic loop control mechanisms are available.

Attribute functions and aggregations:

Cypher offers a range of composable functions to be used in queries. These include:

  • standard mathematical functions: numeric, logarithmic and trigonometric
  • predicate functions — evaluate to a boolean depending whether a condition holds for specified configuration of elements (all, any, exists, none, single)
  • string, scalar and list functions
  • temporal and spatial functions
  • standard aggregating functions e.g. avg, count, max, min, stDev, sum
  • user-defined functions — these need to be defined in Java and loaded into the database. Afterwards they can be called similar to the built-in functions. Cypher differentiates scalar and aggregating user-defined functions.

Subgraph Retrieval: Cypher does not have native support for subgraph retrieval. Subgraphs can be defined and retrieved by using a specialised procedure.

Reasoning support: None

SQL completeness: SQL-complete

TigerGraph — GSQL

GSQL is a Turing-complete query language geared for efficiently expressing graph analytics. It adopts SQL-like syntax to facilitate transitioning from relational databases. The fact that both nodes and edges are strictly typed allows for optimal storage format and query execution strategies. Queries in GSQL are approached in similar fashion to SQL stored procedures. GSQL queries typically involve multiple SELECT statements and imperative instructions such as branches and loops.

Navigational graph patterns: support for multi-hop patterns with variables with possibility of returning both endpoints. Edges can be only retrieved if source vertex and edge type is specified. GSQL supports configurable variable path length patterns via the Kleene’s star operator. Both paths of arbitrary lengths are supported as well as paths with path length bounds. Variable path queries have shortest path semantics.

Query composition and subqueries: GSQL differentiates interpreted and compiled queries. Query compilation allows performance optimisation. Compiled queries are named and parameterised akin to SQL procedures. Linear query composition is supported. GSQL supports treating previously installed queries as named, parameterised calls and using them as subqueries.

Control flow: GSQL supports general ****if-then-else style branching (includes also SQL-like case switches) as well as while loops. Both constructs are subject to boolean conditions involving global accumulators, global variables, and query parameters. GSQL allows the user to compose the query flow freely which is particularly beneficial for defining iterative computations. Full support for control primitives differentiates GSQL from other query languages and renders it Turing-complete.

Attribute functions and aggregations:

GSQL supports standard mathematical operators as well as boolean and bit operators. It offers standard string manipulation and aggregation functions (count, sum, min, max, avg). Finally, GSQL allows definition of general functions of attribute values within queries.

GSQL introduces the concept of accumulators associated with paths. As the path is traversed, the associated data can be collected and aggregated into accumulators according to diverse grouping criteria. GSQL allows to store intermediate computation results on nodes allowing general iterative processing and straight-forward execution of MPP algorithms. The intermediate results can be stored either as simple (scalar, collections of scalars) or composite datatypes — tuples of simple datatypes.

Subgraph Retrieval: Not supported.

Reasoning support: None

SQL completeness: SQL- and Turing-complete

Stardog and Virtuoso — SPARQL

SPARQL is a declarative query language designed for RDF by the W3C. SPARQL allows to interact both with native RDF data as well as data virtualised as RDF. It supports conjunctive and disjunctive patterns, negation, aggregations, subqueries as well as the ability to constrain the search space to a specific RDF subgraph. SPARQL queries can return both result sets as well as RDF graphs.

Navigational graph patterns: multi-hop patterns are straightforwardly expressed as triple chains. Each of the three triple constituents are referrable and retrievable. Support for variable path length via property paths. Path occurence bounds are specifiable. Additionally path inverses, sequences and alternatives are supported.

Query composition and subqueries: Linear query composition for SELECT queries is supported**.** SPARQL offers full support for nested SELECT queries. Inner queries are evaluated first and their result variables can be used in the parent SELECT query. To speed up execution, inner queries can have LIMIT applied to them. Disjunctive queries can be defined via UNION operator. Defining predicate alternatives or assigning multiple values to variable via VALUES operator is also possible. SPARQL supports matching optional patterns with OPTIONAL keyword. Optional results can have defaults assigned to them by using the COALESCE construct.

Control flow: Sparql supports an IF keyword as a three argument functional form. It has no support for loops.

Attribute functions and aggregations:

Both Stardog and Virtuoso support all functions from the SPARQL 1.1 specification: string functions, temporal and geospatial functions, scalar functions of variables and numerical functions. Additionally Stardog has added support for some functions from XPath and SWRL. On top of that Stardog’s implementation offers extra support for mathematical (logarithmic, trigonometric and others) and statistical (standard aggregates) functions.

Subgraph Retrieval:

CONSTRUCT queries can be used to retrieve both existing subgraphs or their derivatives in a form of list of triples.

Stardog Reasoning support:

Stardog offers reasoning capabilities with semantics defined by OWL 2. Stardog’s reasoning utilises a query rewriting approach which allows for lazy execution and late binding. The reasoning is optional and can be turned off. The active reasoning axioms are controlled by specifying reasoning type. Stardog offers multiple reasoning types:

  • SL, covers all OWL 2 profiles plus Horn-like rules defined via SWRL,
  • RDFS, axioms allowed in RDF Schema: class and property hierarchies, domains and ranges
  • QL, RL, EL, covers OWL 2 QL, RL, and EL axioms respectively
  • DL, corresponds to OWL 2 DL profile. Requires more preprocessing than the other profiles. Requires both schema and data being pulled into memory which limits the use on larger datasets. Additionally it requires the DB to be consistent between performing any reasoning jobs.

Axioms falling outside of the specified reasoning type are ignored. The query rewriting approach involves rewriting the user input query using available axioms and rules into an extended query which is then executed. Stardog does not guarantee answer completeness in the case of recursive rules.

Stardog supports SWRL rule reasoning with Horn-like clauses. It additionally offers an extension to create non-range restricted rules — rules creating fresh instances. Stardog allows definition of rules with multiple head atoms. These are normalised and decomposed into several rules each containing a single head atom. Stardog additionally offers a reasoning explanation facility by providing the user with proof trees.

Virtuoso reasoning support:

Virtuoso offers limited RDFS and OWL reasoning capabilities. The supported RDFS rules are rdfs:subClassOf and rdfs:subPropertyOf. In terms of OWL support, owl:sameAs is interpreted for arbitrary subjects and objects only if explicitly enabled by a pragma in the query. Additionally, owl:sameAs, owl:equivalentClass and owl:equivalentProperty are recognised when determining subclass or subproperty relations. If two classes are equivalent, they share all instances, subclasses and superclasses directly or indirectly stated in the data for either class. Other RDFS or OWL axioms are ignored.

For rule reasoning, Virtuoso supports SPIN. SPIN allows to define rules via CONSTRUCT queries - additional triples are inferred based on the content of the WHERE clause. For each variable mapping in the WHERE clause, a triple in the CONSTRUCT clause is instantiated and added as an inferred triple to the underlying model. SPIN support is only available in Virtuoso 8 (Commercial Edition).

SQL completeness: SQL-complete

LDBC Social Network Benchmark

The next two pillars of our solution comparison will be evaluated on a basis of a specific benchmarking framework. The chosen framework is the one provided by the LDBC Social Network Benchmark (LDBC SNB)[1]. The benchmark was specifically designed to evaluate and stress-test graph-based data management solutions. It tries to mimic a social network — a domain that is both a popular use case as well as one that is easily and naturally understood.

By offering a platform-independent framework, the LDBC benchmark allows provision of quantitative measures to compare different products and technologies and serves as an aid to solution selection. By defining a wide range of scales and query types, it allows coverage and analysis of different choke points. First, as it provides a scalable dataset, it allows to stress-test the data loading aspect. Second, the benchmark provides an extensive query specification allowing for evaluating expressivity and different language constructs of graph query languages of interest. Finally, with the defined query types and workloads, it allows detailed analysis of query execution choke points and evaluation of the implementation efficiency of specific query traversal engines.

The benchmark defines three workloads: Interactive Short (IS), Interactive Complex (IC) and Business Intelligence (BI). The IS workload covers queries fetching strictly local information — the corresponding traversal paths contains at most 2 hops. The queries in the IC workload are designed to traverse paths going beyond 2 hops. Additionally, instead of returning tuples, they compute simple aggregates. The IC queries typically contain a strictly defined single origin. The last group of BI queries covers scenarios requiring access to larger parts of the graph. This is achieved by generalising origins in the IC workload — the resulting traversals have multiple sources in the graph. The BI queries additionally involve more complex aggregates rendering them the most compute-intensive of the three.

The scale of the benchmark is customisable via single parameter referred to as the Scale Factor. The Scale Factor quantifies the size of the dataset generated to run the benchmark, measured in Gigabytes in CSV format. The schema of the LDBC SNB benchmark can be readily visualised. The full schema is presented in the diagram below (adopted directly from the original specification). The schema describes a social network of people, their activities and interactions via messages on forums, over a period of time. It models the domain via different entities and relationships between them.

LDBC SNB schema — adopted from the official specification.

While structurally the schema corresponds to a graph, the benchmark does not impose any particular physical representations of the data.

For the purposes of this comparison, we will focus on the Interactive Complex workload — it offers a sweet spot between the locality of Interactive Short queries and complexity and globality of Business Intelligence queries. The IC workload also allows to test typical graph access patterns and offers query complexity level that is easy to follow and reason about. For testing purposes we used the following software versions:

  • Neo4j 4.0.4 Community Edition
  • Tigergraph 2.5.2 Developer Edition
  • Stardog 7.3.0
  • Virtuoso 7.2.5 Open Source

Data ingestion

Single machine performance

We will begin comparing data ingestion capabilities by considering single machine performance when loading LDBC SNB datasets of different sizes. The single machine loading was performed on Google Cloud machines with the following configuration:

The datasets were generated in CSV format for Neo4j and TigerGraph, and TTL for Stardog. The TTL datasets have increased storage size — of about a factor of 3. The loading was performed using the following methods:

  • Neo4j — using the batch import tool. The tool is designed to provide a convenient way of importing bulk data in different formats in an offline mode. It allows to load elements with the same label in a single command. Data fields are interpreted by defining suitable file headers following a strict format. For efficient retrieval of data an additional post-processing step of index assignment is required. The created indexes were chosen to be Person(id), Message(id), Post(id), Comment(id), Forum(id), Organisation(id), Place(name), Tag(name), and TagClass(name).
  • TigerGraph — using a loading job. Loading jobs are defined as declarative statements in GSQL and allow to define parallel loading pipelines by simply defining a mapping between input columns and element types and their properties. Loading jobs do not require any extra pre- or post-processing steps — indexing and id recognition are performed automatically as a part of the loading process.
  • Stardog — using bulk loading with compressed and partitioned input data. The load was performed using the stardog-admin db create command which in contrast to the stardog db add command is not transactional. The command creates a new database, populates and indexes the data and only then brings the database online.
  • Virtuoso — using RDF bulk loader. The bulk loader is non-transactional and by default disables transaction logging, checkpointing and scheduling The input TTL files were partitioned and gzipped. Multiple loaders were used using the no of loaders=cores/2.5 rule of thumb. The memory settings were tuned according to Virtuoso guidelines.

It is worth mentioning that bulk loading offers top performance when loading considerable volumes of data however it is not necessarily the best loading approach for all use cases. The bulk loading aproach is suitable when the sources are well-defined and available — it is perfect for initial data imports. For cases when the data is streamed or rapidly changing bulk loading is inadequate. The alternative transactional approaches offer worse performance while being suitable for performing more fine-grained and incremental data updates.

The single machine loading times are depicted in the figure below. The Neo4j loading times include post-processing indexing time. The Stardog loading times include time taken for post-processing jobs: indexing and statistics counting. We can clearly see that for all tested datasets Neo4j offers the fastest loading times. The smallest datasets SF1 and SF10 are loaded within minutes by Neo4j and Tigergraph: they respectively require 36 and 215 seconds for Neo4j and 66 and 292 seconds for TigerGraph. Stardog loads SF1 and SF10 datasets within 2.5 minutes and 40 minutes 44 seconds respectively, whereas for the same datasets Virtuoso requires 4 and 20 minutes. To load the SF100 dataset Neo4j needs 38 minutes, TigerGraph 1h 20 min, Stardog 6h 41 min and Virtuoso 3h and 29 minutes. Finally, Neo4j loads the SF300 dataset in under 2hrs (1h 50 min) whereas TigerGraph needs almost 4 hours (3h 55 min), Stardog almost 19 hours (18 h 43 min) and Virtuoso just over 9 hours (9 h 4 min). The Neo4j displays the fastest loading times even though Neo4j’s loading requires an extra post-processing indexing step. Compared with results presented by Rusu and Huang [3], the indexing procedure seems to be improved when comparing the tested version of 4.0.4 with 3.5.0. Excluding indexing and statistics counting, Stardog was found to require around 35 mins to load 1 billion triples (@SF300), whereas taking into account the full loading process it averaged to 78 minutes. Virtuoso was found to require on average about 38 mins to fully load a billion triples (@SF300).

Bulk loading times for different LDBS SNB dataset sizes.

Clustering options

Neo4j

The primary Neo4j’s offering is the Causal Clustering. The Causal Clustering provides fault tolerance and read scalability by introducing Core and Replica servers. Core servers are responsible for safeguarding data and replicating it across Replica servers. Transactions are replicated using Raft protocol and a majority of servers needs to acknowledge a transaction before it is committed. The Replica servers function as graph caches capable of executing read-only queries — they allow to scale out read workloads.

In version 4.0, Neo4j introduced its Enterprise Fabric feature. Fabric can be understood as a coordinating system that enables unified access between different graphs. Consquently it enables:

  • data federation — multiple disjoint graphs can be distributed across different machines and accessed as a single logical DB,
  • data sharding — a given graph can be partitioned across separate machines and accessed as a single graph

The task of determining the best way of partitioning data rests on the user. This is in general a non-trivial task, especially for highly-interconnected data. The most efficient approach is to define partitions as disjoint subgraphs. In practical applications however, this is not always possible. Once the Fabric configuration is specified, different graphs can be queried simultaneously by using Cypher extensions. Fabric can be used in conjunction with the Causal Cluster. Fabric guarantees ACID compliance only within a single graph.

TigerGraph

TigerGraph allows to deploy instances in a cluster configuration. Cluster configuration enables database partitioning and distribution across multiple server nodes in a local network. The cluster configuration is only available in the Enterprise version. The cluster can be composed from physical machines or provisioned from a cloud service. In terms of cloud hosting, TigerGraph offers solutions on Google Cloud, AWS and Azure as well as its TigerGraph Cloud. TigerGraph offers HA deployments where active-active replication is used to provide continuous service when one or more servers are not available.

In terms of database parititioning and distribution, Tigergraph offers both automatic as well as application-specific and mixed partitioning strategies. In the automatic mode all out edges from a given node are stored on the same server and hash indices are used to determine the data location. TigerGraph additionally allows th run multiple graph engines as an active-active network. In such a network, each graph engine can store the same graph partitioned according to different partitioning strategies. In that way different application use cases can benefit from appropriately tailored partitionings with queries being routed to appropriate graph engines.

Stardog

Stardog offers a possibility of clustered HA deployments. A Stardog cluster consists of Stardog servers and a collection of ZooKeeper instances running together. The Stardog cluster operates in a master-slave fashion where the master node, referred to as the Coordinator, orchestrates transactions and maintains cluster consistency by expelling any nodes that fail an operation. An expelled node must sync with a current member to rejoin the cluster. Stardog does not support data partitioning, instead it offers master-master HA replication with immediate consistency. The orchestration of the cluster is left to the user with the exception of the Stardog Graviton service offered on AWS. Graviton is an open-source tool that allows automated deployments of fully functional Stardog clusters. The tool uses terraform and packer to provide configurable deployments of multi-node Stardog clusters where each node is backed by multiple EBS volumes. The tool spawns configurable Zookeper ensembles as well.

Virtuoso

Virtuoso offers two cluster modes that can be used in conjunction with each other: the Elastic Cluster and the Replication Cluster. Both are available only in Commercial editions. Elastic clustering serves as the main mechanism for scaling out by building Virtuoso clusters. Elastic clustering allows to shard the data across a large number of partitions which can be distributed and migrated between servers. New partitions or shards can be both relocated or added dynamically. The Replication Cluster offers data redundancy and disaster recovery with multiple possible topologies (star, chain, bi-directional). Additionally it provides some degree of load balancing and high availability limited by the dependence of subscriber nodes on publisher nodes. Virtuoso clusters are fully transactional and similarly to single-server deployments, support four different isolation levels. Transactions can be committed using single or two phase commits.

Data retrieval

The data retrieval performance was assessed on the basis of performance of the Interactive Complex queries defined in the LDBC SNB benchmark. Interactive queries represent read-only multi-hop queries that occur naturally in many workflows considering social interactions. They are localised by design — each of them contains a well-defined starting point from which a multistep traversal proceeds. Additionally they compute simple aggregates. The queries reflect typical social network functionalities and were designed to stress-test different query optimisation choke points. The choke points can be grouped into multiple categories:

  • aggregation performance — covers different configurations of aggregations including low- and high-cardinality aggregations, top-k result aggregations as well as ability to exploit particular orders.
  • join performance — even with a single starting point, in general query patterns can be traversed in multiple different ways. This happens especially when the query pattern departs from a simple chain structure and introduces branching. The difference in execution times between different orderings of traversal steps can be of orders of magnitude. An optimal query plan allows to minimise join sizes by effectively exploiting result pruning and appropriately applying hash and indexed joins.
  • data access locality — covers scattered index access, ability to detect data correlations and storage-implemented dimensional splitting.
  • expression evaluation — includes exploiting techniques for expression simplification and optimisation including string and temporal operations.
  • sub-query performance- tests performance of composed queries including result reuse and query flattening.
  • complex path handling — designed to test handling of variable and transitive paths.

The queries together with their specific choke points are described in the query table below.

LDBC SNB query specification.

The query implementations were adopted from the official LDBC implementation repository [2] as well as from the work of Rusu and Zhuang [3] and updated to the latest syntax versions if necessary. Query 14 does not have a native SPARQL implementation. Query 13 is implementable in SPARQL by using Stardog’s path extensions. The queries were executed in numerical sequence. In all the cases, for each query, 5 substitution parameters were applied. As a result, each query type was executed 5 times with different parameters. The substitution parameters were generated using the benchmark facilities. The resulting query execution times were averaged per query type. No extra pre-processing procedures were applied such as initial runs or database warm-ups. The query execution results for SF1, SF10, SF100 and SF300 datasets for all the vendors are summarised in the plots below. For the SF1 and SF10 datasets query timeouts were set to 30 minutes. For the SF100 dataset the query timeout was set to 60 minutes. Finally the timeout for the SF300 dataset was set to 90 minutes.

Query execution times for different LDBC SNB dataset sizes: SF1, SF10, SF100, SF300.

In terms of execution times, Tigergraph was found to offer superior results across all dataset sizes. The only queries for which Tigergraph didn’t offer the best performance were IC1 and IC14 @SF100 and IC1 @SF300. This performance superiority comes with a caveat though. When querying, Tigergraph tries to load the entire dataset into memory. As the queries were run in sequential fashion and without warm-ups, we can see that in Tigergraph case the first query always takes a long time to compute. We can assume that this corresponds to the graph being loaded into memory. After the graph is loaded, the queries perform with second and sub-second performance with the exception of query 14. Tigergraph offers a possibility to restrict the memory size used, however this option is only available in the Enterprise Edition. Please note that the machine specifications used for the benchmark allowed to fit entire graph into memory.

The next in line in terms of query performance come Neo4j and Virtuoso. Both were found to timeout on two queries among all tested configurations: query 14 @SF10 and query 6 @SF300 using Neo4j and query 4 and query 10 @SF100 using Virtuoso. Neo4j showed to be faster at lower scale factors SF1 and SF10 — it was found to offer better performance in 8 and 9 query configurations respectively. This tendency seems to be reverted for the higher scale factors of SF100 and SF300 — Virtuoso’s query execution was found to be consistently faster for queries {2, 3, 5, 6, 9, 11, 12} whereas Neo4j was significantly faster for queries {4, 7, 10}.

Stardog was found to offer the slowest execution times among the compared solutions. It was found to timeout in 9 different query configurations: queries {5, 9} @ SF10, queries {1, 3, 9} @SF100 and queries {1, 3, 9, 10} @SF300. Additionally queries {2 ,4, 5} were found to exhibit lacklustre performance at higher scale factors. The low performance of queries {1, 2, 4, 5, 9} can be attributed to join performance. Specifically queries {1, 9} suggests problems with join order optimisation, whereas performance problems for queries {2, 4, 5} could be demonstrative of issues with join type selection.

Summary

We have analysed and compared four different solutions in view of their applicability and usability when considering development of knowledge graphs. Consequently we have compared solutions according to their offerings in terms of their data model, schema support, available query languages and data ingestion and retrieval. We concluded that the solutions under consideration displayed different strong points with respect to our requirements. Consequently we found that in terms of semantics Stardog offers the widest spectrum of facilities: including schema definition and enforcement via its RDFS and OWL support as well as data validation options offered by supporting definition and validation of shapes using SHACL. Additionally it offers the ability to define general rules in the form of Horn-clauses and query-time reasoning. When comparing query languages we ascertained that SPARQL, supported by Stardog and Virtuoso, and Cypher supported by Neo4j, exhibit comparable degrees of expressivity — both are SQL-complete, with the differentiating factor being the reasoning support offered by Stardog and to an extent Virtuoso. In comparison, Tigergraph’s query language, GSQL was found to offer extended expressivity by enabling general query control flow constructs (making it Turing-complete) and allowing native implementation of complex MPP-based analytics algorithms in queries.

We assessed data ingestion and retrieval capabilities based on an established framework — the LDBC Social Network Benchmark. The benchmark is customisable in size and describes a common social domain. In terms of data ingestion performance, we examined bulk loading throughput to find that Neo4j offers best non-transactional loading performance — it required just under 2 hours to load the biggest dataset tested (300GB). With respect to loading performance, Tigergraph was found second with Virtuoso and Stardog following up requiring approximately 4, 9 and 19 hours to fully load the biggest dataset. To compare query retrieval, we used the Interactive Complex query set from the LDBC benchmark covering a typical query workload for a social domain. In terms of query execution Tigergraph was found to offer best performance. This comes with a caveat that in order to shine it requires to load the full dataset into memory which might be problematic for bigger datasets. Neo4j and Virtuoso were found to offer comparable query performance with Neo4j being better at smaller dataset sizes and Virtuoso at bigger ones. In comparison, we discovered that Stardog typically executed the tested queries the longest with the biggest number of queries timing out. We attributed this characteristic to problems with join optimisation.

In summary, based on our comparison, the choice of solution can be seen as a tradeoff between semantics and performance. The specific choice is then determined by specific use case and factors such as required data access patterns, data volume and granularity, data dynamics, heterogeneity of sources or required analytics. Consequently cases not requiring absolute performance — when data is of limited quality, strongly partitioned, possible to load incrementally or of limited size and velocity, will promote semantics-focused solutions. On the other hand high data volumes and velocity as well as need for online analytics will encourage performance-focused options.

Alternative Solutions

Our comparison was scoped to four different graph solutions. The different vendors were chosen so that the comparison could provide a good picture of the solution landscape, covering common modelling approaches, popular standards and typical problems faced. A number of interesting alternative solutions exists however.

Grakn is an open-source solution following the hypergraph model — extending the graph model to include edges connecting more than two entities. Grakn graphs can be accessed by using its query language Graql. Graql is a declarative query language allowing definition of semantic and validating schemas where vocabularies are defined in terms of the extended entity-relationship model. The language is equipped with a datalog-inspired rule-based reasoning engine where inferences are made at query-time.

DGraph is an open-source graph database following the property graph model. DGraph was designed to be run in distributed environments and at scale. DGraph uses a derivative of GraphQL as its query language — referred to as GraphQL+-. DGraph offers an optional type system, but is schemaless otherwise.

JanusGraph is an open-source property Graph and a successor of Titan. JanusGraph is a Tinkerpop-enabled database that uses Gremlin for querying. Gremlin queries consist of series of steps through which objects flow and evolve with the information being gathered along the way. Gremlin has operational semantics resembling functional programming style and is Turing-complete.

Amazon Neptune is schemaless cloud-native graph solution offered on AWS. Amazon Neptune supports both property graphs as well as RDF graphs and offers Gremlin and SPARQL APIs to access them respectively.

References

  1. LDBC Social Network Benchmark
  2. Official LDBC SNB implementations
  3. Rusu, F., & Huang, Z. (2019). In-Depth Benchmarking of Graph Database Systems with the Linked Data Benchmark Council (LDBC) Social Network Benchmark (SNB)

--

--