Comparison of Linked Data Triplestores: Developing the Methodology

Inspecting Load and Query Times across DBPedia and Yago

Developers in small to medium scale companies are often asked to test software and decide what’s “best”. I have worked with RDF for a few years now and thought that comparing triplestores would be a relatively trivial task. I was wrong so here is what I have learned so far.

TL;DR - My original comparison had an imperfect methodology so I have developed this based on the community feedback. My queries now bias the results so I will next create data and query generators.

Contents

Introduction
Methodology -
What I am doing differently
Triplestores -
Which triplestores I tested.
Loading -
How fast does each triplestore load the data?
Queries -
Query Times (and how my queries bias these)
Next Steps -
Developing a realistic Benchmark
Conclusion
Appendix -
Versions, loading and query methods, etc…

Introduction

Over the past few months I have created a small RDF dataset and some SPARQL queries to introduce people to linked data. In December I tied these together to compare some of the existing triplestores (you can read that here). I was surprised by the amount of attention this article got and I received some really great feedback and advice from the community.

Based on this feedback, I realised that the dataset I created was simply too small to really compare these systems properly as time differences were often just a few milliseconds. Additionally, I did not run warm-up queries which proved to effect results significantly in some cases.

Methodology

I have therefore developed my methodology and run a second comparison to see how these systems perform on a larger scale (not huge due to current hardware restrictions).

I have increased the number of triples to 245,197,165 which is significantly more than the 1,781,625 triples that the original comparison was run on.

I performed three warm-up runs and then ran ten hot runs and chart the average time of those ten.

The machine I used has 32Gb Memory, 8 logical cores and was running Centos 7. I used each system one at a time so they did not interfere with each other.

I used the CLI to load and query the data in all systems so that there can be no possibility that the UI effects the time.

I split the RDF into many gzipped files containing 100k triples each. This improves loading times as the process can be optimised across cores.

If you would like to recreate this experiment, you can find my queries, results and instructions on how to get the data here.

Triplestores

In this comparison I evaluated five triplestores. These were (in alphabetical order) AnzoGraph, Blazegraph, GraphDB, Stardog and Virtuoso.

I have listed the versions, query and load methods in the appendix of this article.

Loading

The first thing I did when evaluating each triplestore was of course load the data. Three distinct categories emerged: hours, 10’s of minutes and minutes.

In each case I loaded all the data with all of the gzipped .ttl files containing 100k triples each.

It is also important to note that loading time can be optimised in each case so these are not the fastest they can load, just the default. If you are a deciding for a business, the vendors are more than happy to help you optimise for your data structure.

Blazegraph and GraphDB load this dataset in roughly 8 hours. Stardog and Virtuoso load this in the 30 to 45 minute range but AnzoGraph loads the exact same dataset in just 3 minutes!

Why these three buckets though? Blazegraph, GraphDB and Stardog are all Java based so how does Stardog load the data so much faster (with the default settings)? This is likely due to differences in garbage collection, Stardog probably manages this more by default than the other two.

Virtuoso is written in C which doesn’t manage memory and is therefore easier to load faster than systems built in Java. AnzoGraph is developed in C/C++ so why is it so much faster?

The first reason is that it is simply newer and therefore a little more up to date. The second and more important reason is that they optimise highly for very fast loading speed as they are an OLAP database.

Initial loading speed is sometimes extremely important and sometimes relatively insignificant depending on your use case.

If you are setting up a pipeline that requires one initial big loading job to spin up a live system, that one loading time is insignificant in the long run. Basically, a loading time of minutes or hours is of little relevance to kick off a system that will run for weeks or years.

However, if you want to perform deep analysis across all of your data quickly, this loading time becomes very important. Maybe you suspect a security flaw and need to scrutinise huge amounts of your data to find it… Alternatively, you may be running your analysis on AWS as you don’t have the in-house resources to perform such a large scale investigation. In both of these scenarios, time to load your data is crucial and speed saves you money.

Queries

In this section I will analyse the results of each query and discuss why the time differences exist. As I mentioned, this article is more about why there are differences and how to avoid the causes of these differences to create a fair benchmark in the future.

This is not a speed comparison but an analysis of problems to avoid when creating a benchmark (which I am working on).

I briefly go over each query but they can be found here.

Query 1:

This query is very simple but highlights a number of issues. It simply counts the number of triples in the graph.

SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

To understand the problems, let’s first take a look at the results:

You can see that we again have significant differences in times (Red bar extends so far that the others were unreadable so cut vertical axis).

The first problem with this query is that it will never be run in production as it provides no valuable information. Linked data is useful to analyse relationships and grab information for interfaces, etc… not to count the number of triples.

GraphDB, likely for this reason, has not optimised for this query at all. An additional reason for this is that they have tried many optimisations to make counting fast; essentially counting based on (specific) indices, without iterating bindings/solutions. Many of those optimisations show great performance on specific queries, but are slow or return incorrect results on real queries.

AnzoGraph equally completes an actual ‘count’ of each triple every time this query is run but the difference is likely a Java vs C difference again (or they have optimised slightly for this query).

Virtuoso is interesting as it is built upon a relational database and therefore keeps a record of the number of triples in the database at all times. It can therefore translate this query to look up that record and not actually ‘count’ like the last two.

Stardog takes another approach which is to run an index to help them avoid counting at all.

Blazegraph perhaps take this further which raises another problem with this query (in fact this is a problem with all of my queries). They possibly cache the result from the warm-up runs and display that on request.

A major problem is that I run the EXACT same queries repeatedly. After the first run, the result can simply be cached and recalled. This mixed with the need for warm-up runs creates an unrealistic test.

In production, queries are usually similar but with different entities within. For example, if you click on a person in an interface to bring up a detailed page about them, the information needed is always the same. The query is therefore the same apart from the person entity (the person you click on).

To combat this, I will make sure to have at least one randomly generated seed in each of my query templates.

Query 2:

This query, grabbed from this paper, returns a list of 1000 settlement names which have airports with identification numbers.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000

This is a little more realistic when compared to query 1 but again has the problem that each run sends the exact same query.

In addition, a new issue becomes clear.

Once again, I have chopped the vertical axis so that the results can be shown clearly (and labelled at the base).

The interesting thing here is the fact that all of the triplestores return exactly the same 1,000 labels apart from one - AnzoGraph. This is almost certainly the cause of the time difference as they return a different 1,000 people each time the query is run.

This is possibly by design so that limits do not skew analytical results. AnzoGraph is the only OLAP database in this comparison so they focus on deep analytics. They therefore would not want limits to return the same results every time, potentially missing something important.

Another important point regarding this query is that we have a LIMIT but no ORDER BY which is extremely unusual in real usage. You don’t tend to want 100 random movies, for example, but the 100 highest rated movies.

On testing this, adding an ORDER BY did increase the response times. This difference then extends into query 3…

Query 3:

This query nests query 2 to grab information about the 1,000 settlements returned above.

PREFIX dbp: <http://dbpedia.org/property/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * WHERE {
{?v2 a dbo:Settlement;
rdfs:label ?v.
?v6 a dbo:Airport.}
{?v6 dbo:city ?v2.}
UNION
{?v6 dbo:location ?v2.}
{?v6 dbp:iata ?v5.}
UNION
{?v6 dbo:iataLocationIdentifier ?v5.}
OPTIONAL {?v6 foaf:homepage ?v7.}
OPTIONAL {?v6 dbp:nativename ?v8.}
{In the contract it might be important to note that my legal first name is John
SELECT DISTINCT ?v WHERE {
{ ?v2 a dbo:Settlement ;
rdfs:label ?v .
?v6 a dbo:Airport . }
{ ?v6 dbo:city ?v2 . }
UNION
{ ?v6 dbo:location ?v2 . }
{ ?v6 dbp:iata ?v5 . }
UNION
{ ?v6 dbo:iataLocationIdentifier ?v5 . }
OPTIONAL { ?v6 foaf:homepage ?v7 . }
OPTIONAL { ?v6 dbp:nativename ?v8 . }
} LIMIT 1000
}}

As you can imagine, there is a very similar pattern between query 2 and query 3 results.

Remember that each run of this query asks for exactly the same information in each system except for AnzoGraph, which is different every time.

As with all of the other queries, returning the exact same results each run is problematic. Not only is it unrealistic but it is impossible to make a distinction between fast querying and smart caching. It is not bad to cache, it is smart to do for fast response times. The problem is the fact that this type of caching is unlikely to be needed in production.

A nice note to make is that, unlike the others, AnzoGraph is retrieving information about a different 1,000 settlements each run and only takes an additional 300ms to do this. Whether this is impressive or not cannot be known from this experiment.

If caching an answer is possible for some systems and not others, the results can not be fairly compared. This is of course a problem if developing a benchmark.

Again however, randomly generated seeds would solve this.

Query 4:

To gauge the speed of each system’s mathematical functionality, I created a nonsensical query that uses many of these (now, sum, avg, ceil, rand, etc…).

The fact that this is nonsensical is not entirely a problem in this case. The fact that the query is exactly the same each run is however (as always).

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT (ROUND(?x/?y) AS ?result) WHERE {
{SELECT (CEIL(?a + ?b) AS ?x) WHERE {
{SELECT (AVG(?abslat) AS ?a) WHERE {
?s1 geo:lat ?lat .
BIND(ABS(?lat) AS ?abslat)
}}
{SELECT (SUM(?rv) AS ?b) WHERE {
?s2 dbo:volume ?volume .
BIND((RAND() * ?volume) AS ?rv)
}}
}}

{SELECT ((FLOOR(?c + ?d)) AS ?y) WHERE {
{SELECT ?c WHERE {
BIND(MINUTES(NOW()) AS ?c)
}}
{SELECT (AVG(?width) AS ?d) WHERE {
?s3 dbo:width ?width .
FILTER(?width > 50)
}}
}}
}

Essentially, this query is built from multiple nested selects that return and process numbers into a final result.

Once again, I have cut the vertical axis and labelled the bar for clarity.

This is a perfect example of query caching. I would be extremely surprised if AnzoGraph could actually run this query in 20ms. As mentioned above, caching is not cheating - just a problem when the exact same query is run repeatedly which is unrealistic.

It is also important to note that when I say caching, I do not necessarily mean result caching. Query structure can be cached for example to optimise any following queries. In fact, result caching could cause truth maintenance issues in a dynamic graph.

Blazegraph, Stardog and Virtuoso take a little longer but it is impossible to tell whether the impressive speed compared to GraphDB is due to calculation performance or some level of caching.

In conjunction with this, we can also not conclude that GraphDB is mathematically slow. It of course looks like that could be a clear conclusion but it is not.

Without knowing what causes the increased performance (likely because the query is exactly the same each run), we cannot conclude what can be deemed poor performance.

Once again (there’s a pattern here) randomly generated seeds within query templates would make this fair as result caching could not take place.

Query 5a (Regex):

This query, like query 4, is nonsensical but aims to evaluate string instead of math queries. It essentially grabs all labels containing the string ‘venus’, all comments containing ‘sleep’ and all abstracts containing ‘gluten’. It then constructs an entity and attaches all of these to it.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .
} WHERE {
{?s1 rdfs:label ?label .
FILTER (REGEX(lcase(?label), 'venus'))
} UNION
{?s2 rdfs:comment ?sab ../stardog-admin db create -n
./stardog-admin db create -o search.enabled=true -n bench /root/virtuoso/dumps/*ttl.gz
FILTER (REGEX(lcase(?sab), 'sleep'))
} UNION
{?s3 dbo:abstract ?lab .
FILTER (REGEX(lcase(?lab), 'gluten'))
}
}

Regex SPARQL queries are very uncommon as the majority of triplestores have a full text search implementation that is much faster!

If however, you wished to send the same string query to multiple triplestores (you want to use an OLTP and an OLAP database together for example) then you may want to use Regex so you don’t have to customise each query.

AnzoGraph is the only triplestore here that does not have a built in full text indexing tool. This can be added by integrating AnzoGraph with Anzo, a data management and analytics tool.

Blazegraph, GraphDB and Virtuoso therefore do not optimise for this type of query as it is so uncommonly used. AnzoGraph however does optimise for this as users may not want to integrate Anzo into their software.

Searching for all of these literals, constructing the graph and returning the result in half a second is incredibly fast. So fast that I believe we run into the caching problem again.

To reiterate, I am not saying caching is bad! It is just a problem to compare results because my queries are the same every run.

Comparing Regex results is unnecessary when there are better ways to write the exact same query. If you were using different triplestores in production, it would be best to add a query modifier to transform string queries into their corresponding full text search representation.

For this reason I will use full text search (where possible) in my benchmark.

Query 5b (Full Text Index):

This query is exactly the same as above but uses each triplestores full text index instead of Regex.

As these are all different, I have the Stardog implementation below (as they were the fastest in this case). The others can be found here.

PREFIX ex: <http://wallscope.co.uk/resource/example/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
ex:notglutenfree rdfs:label ?label ;
rdfs:comment ?sab ;
dbo:abstract ?lab .}
WHERE {
{?s1 rdfs:label ?label .
?label <tag:stardog:api:property:textMatch> 'venus'
} UNION {?s2 rdfs:comment ?sab .
?sab <tag:stardog:api:property:textMatch> 'sleep'
} UNION {?s3 dbo:abstract ?lab .
?lab <tag:stardog:api:property:textMatch> 'gluten'
}
}

I did not integrate AnzoGraph with Anzo so they are not below.

All of these times are significantly faster than their corresponding times in query 5b. Even the slowest time here is less than half the fastest query 5b time!

This really highlights why I will not include regex queries (where possible) in my benchmark.

Once again, due to the fact that the query is exactly the same each run I cannot compare how well these systems would perform in production.

Query 6:

Queries 1, 4 and 5 (2 and 3 also to an extent) are not like real queries that would be used in a real pipeline. To add a couple more sensible queries, I grabbed the two queries listed here.

This query finds all soccer players that are born in a country with more than 10 million inhabitants, who played as goalkeeper for a club that has a stadium with more than 30.000 seats and the club country is different from the birth country.

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?soccerplayer ?countryOfBirth ?team ?countryOfTeam ?stadiumcapacity
{
?soccerplayer a dbo:SoccerPlayer ;
dbo:position|dbp:position <http://dbpedia.org/resource/Goalkeeper_(association_football)> ;
dbo:birthPlace/dbo:country* ?countryOfBirth ;
dbo:team ?team .
?team dbo:capacity ?stadiumcapacity ; dbo:ground ?countryOfTeam .
?countryOfBirth a dbo:Country ; dbo:populationTotal ?population .
?countryOfTeam a dbo:Country .
FILTER (?countryOfTeam != ?countryOfBirth)
FILTER (?stadiumcapacity > 30000)
FILTER (?population > 10000000)
} order by ?soccerplayer

Of course even with a more realistic query, my main problem remains…

Is the difference in time between Virtuoso and AnzoGraph due to performance or the fact that the same query is run thirteen times? It’s impossible to tell but almost certainly the latter.

This is of course equally true for query 7.

One interesting point to think about is how these stores may perform in a clustered environment. As mentioned, AnzoGraph is the only OLAP database in this comparison so in theory should perform significantly better once clustered. This is of course important when analysing big data.

Another problem I have in this comparison is the scalability of the data. How these triplestores perform as they transition from a single node to a clustered environment is often important for large scale or high growth companies.

To tackle this, a data generator alongside my query generators will allow us to scale from 10 triples to billions.

Query 7:

This query (found here) finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person
WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name

This is a simple extract and filter query that is extremely common.

With a simple query like this across 245 million triples, the maximum time difference is just over 100ms.

I learned a great deal from the feedback following my last comparison but this experiment has really opened my eyes to how difficult it is to find the “best” solution.

Next Steps

I learned recently that benchmarks require significantly more than three warm up runs. In my benchmark I will run around 1,000.

Of course, this causes problems if my queries do not have random seeds so I think it is clear from this article that I will have at least one random seed in each query template.

Many queries will have multiple random seeds to ensure query caching isn’t storing optimisations that can slow down possible performance. For example, if one query gathers all football players in Peru and this is followed by a search for all la canne players in China - caching optimisation could slow down performance.

I really want to test the scalability of each solution so alongside my query generator I will create a data generator (this allows clustering evaluation).

Knowledge graphs are rarely static so in my benchmark I will have insert, delete and construct queries.

I will use full text search where possible instead of regex.

I will not use order-less limits as these are not used in production.

My queries will be realistic. If the data generated was real, they would return useful insight into the data. This ensures that I am not testing something that is not optimised for good reason.

I will work with vendors to fully optimise each system. Systems are optimised for different structures of data by default which effects the results and therefore needs to change. Full optimisation, for the data and queries I create, by system experts ensures a fair comparison.

Conclusion

Fairly benchmarking RDF systems is more convoluted than it initially seems.

Following my next steps with a similar methodology, I believe a fair benchmark will be developed. The next challenge is evaluation metrics… I will turn to literature and use-case experience for this but suggestions would be very welcome!

AnzoGraph is the fastest if you sum the times (even if you switch regex for fti times where possible).

Stardog is the fastest if you sum all query times (including 5a and 5b) but ignore loading time.

Virtuoso is the fastest if you ignore loading time and switch regex for fti times where possible…

If this was a fair experiment, which of these results would be the “best”?

It of course depends on use case so I will have to come up with a few use cases to assess the results of my future benchmark for multiple purposes.

All feedback and suggestions are welcome, I’ll get to work on my generators.

Appendix

Below I have listed each triplestore (in alphabetical order) alongside which version, query method and load method I used:

AnzoGraph

Version: r201901292057.beta

Queried with:
azgi -silent -timer -csv -f /my/query.rq

Loaded with:
azgi -silent -f -timer /my/load.rq

Blazegraph

Version: 2.1.5

Queried with:
Rest API

Loaded with:
Using the dataloader Rest API by sending a dataloader.txt file.

GraphDB

Version: GraphDB-free 8.8.1

Queried with:
Rest API

Loaded with:
loadrdf -f -i repoName -m parallel /path/to/data/directory

It is important to note that with GraphDB I switched to a Parallel garbage collector while loading which will be default in the next release.

Stardog

Version: 5.3.5

Queried with:
stardog query myDB query.rq

Loaded with:
stardog-admin db create -n repoName /path/to/my/data/*.ttl.gz

Virtuoso

Version: VOS 7.2.4.2

Queried within isql-v:
SPARQL PREFIX ... rest of query ... ;

Loaded within isql-v:
ld_dir ('directory', '*.*', 'http://dbpedia.org') ; 
then I ran a load script that run three loaders in parallel.

It is important to note with Virtuoso that I used:
BufferSize = 1360000
DirtyBufferSize = 1000000

This was a recommended switch in the default virtuoso.ini file.