Just load it into a graph database!

Dean Allemang
6 min readJan 24, 2023

From time to time I make a presentation about data sharing, and quite often the presentation involves technology; what technologies best satisfy the needs around data sharing? When the talk has a technical component like this, it is quite common for someone in the audience to ask, “Why don’t you just load all of your data into a graph database?” More often than not, they mention a particular graph database, and that is most commonly Neo4J.

I have nothing against graph databases — on the contrary, I consider myself a graph data enthusiast and advocate. I don’t know how to write a query in SQL; throughout my career, I have only managed data in graphs, documents or (showing my age here), lists. But this question always puzzles me, since loading data into any database makes no headway at all into the problems I want to solve.

What problems are those? I’ve outlined them in this blog entry. To review, we are interested in sharing data; this means that we have to publish the data somehow, we have to make it easy to find the data, and we have to make it possible to merge data from multiple sources.

Let’s imagine that we have loaded our data into a graph database. This provides us a number of affordances over our data; we can match patterns in the data, we can perform complex analytics (like shortest paths or dense clusters). But how does it address the three issues of publish, find and merge?

Robots loading data drums into a truck.
Loading data into a database, cybernetic style.

Let’s start with publish. Having the data in a database doesn’t make the data available to anyone unless they have remote access to your database. In many cases, the suggestion is to load the data into the community edition of Neo4J, which is a lovely piece of software, but designed specifically to run on a local computer (typically a laptop). Running database software on your laptop does not make that data available to anyone else. It provides no capabilities related to publishing your data. How about loading it into a large-scale corporate database? Unless your company is willing to make its database available to the public, this hasn’t help you publish your data at all. Loading the data into a cloud database doesn’t change this situation; you still need to provide public access to your database in order for this to count as publish at all.

How about find? If you put your data into a database, even one facing the public, either in the cloud or on a dedicated server, how does this help someone to find the data? You could have the google bot index the data (assuming it is in a format that google is likely to read, say, a comma delimited file), but the content of your data is not usually a good indicator what what the data is really about. As a simple example, I could find that a dataset includes my first and last name, but that doesn’t tell me whether this is contact information (like a business card), social network data (say, this blog entry), or confidential information like Protected Health Information (PHI). You could write a description of your data and put it on a web page along with a link to the server that is hosting the data. That is, you use an already existing solution to search (web pages and SEO) to help someone find your data. Putting it into the database didn’t help at all.

Let’s have a look at merge. Suppose you load your data into a graph database; graph databases are pretty good at merging data (you can link entities together with a simple edge in the graph!), but how did loading your data into a database help someone who wants to merge datasets do that? You probably don’t want to allow them to load more data into your database instance (especially if it is on your laptop!), so how are they going to use the fact that you have your data in a database to help them? They might manage a federated query between your database and one that they set up, but they would still have to solve the basic problems of merging; how do they tell when their database refers to the same entity as yours? How do they know whether your data provides the same or complementary information to theirs? Having your data in a database hasn’t addressed any of the key issues.

I’ve outlined why I am puzzled by the suggestion that we could help data sharing simply by loading data into a database, and I could just stop here. But it seems only fair that I should at least outline what sort of activity would make headway into data sharing. A really good resource for this is the FAIR data principles.

I don’t want to go into great detail about the FAIR data principles in this blog; you can learn more here. But let’s look at how the FAIR principles address the data sharing activities we are talking about, publish, find, and merge.

The “F” in FAIR stands for “Findable”, so FAIR directly addresses one of our tasks, namely, finding data. Toward this end, the FAIR principles recommend publishing machine-readable metadata, providing global identifiers for entities, and maintaining links between metadata and data. These are not trivial tasks, but they provide guidelines about how to go about making your data findable, beyond the methods used for web pages, i.e., keyword engineering and SEO.

The “I” in FAIR stands for “Interoperable” — that is, that your data can be merged with other data sets. A key principle here is that reference data itself has to be FAIR; that is, have a global identifier, is itself findable, etc.

If you want a more technical approach, the W3C graph data standard RDF provides a lot of help. First, RDF provides a way to write down your data in a standardized form. Several forms, actually, with a simple equivalence between them; it’s like writing the same sentence in block printing vs. cursive. The two formats express exactly the same message, but look very different.

Expressing your data in RDF also requires assigning a global identifier to each entity, so it meshes well with FAIR practice. But it doesn’t just use any old identification system, it uses the most successful global identifier system in the world today; the URI used in the World Wide Web.

Why is it important that a publication format be standardized? In principle, any format will allow you to publish your data. Having a standard form means that someone who finds your data won’t have to use the same database program that you do. Despite the mathematical similarity of all relational database systems, it is still a challenge to transfer data from one to another. In contrast, there are dozens of systems that can read, write and host queries for RDF data.

The FAIR principles recommend that metadata be machine readable. Again, RDF provides considerable assistance. Not only is each data entity identified globally, so are types of entities and their properties. This means that we can refer (e.g.,) to tables and columns in our metadata descriptions.

RDF really shines when it comes to merge. Two RDF graphs are merged simply by unioning the sets of triples from each graph. Two triples that use the same global identifier are linked in the resulting union graphs.

Notice that all of these benefits of representing data in RDF have nothing to do with loading it into an RDF database. The FAIR benefits of RDF come from the process of representing the data in RDF (or RDFS, in the case of the metadata) directly. In fact, you could publish data in RDF without ever using an RDF database at all.

Having said that, of course, you can load your RDF data into an RDF database, and yes, you can make that database available to the public. You can see an example of the data.world RDF store doing just this at workingontologist.org; every dataset mentioned in the book Semantic Web for the Working Ontologist is available from that website, but also hosted in a triple store. Additionally, every query in the book is available as well, and students of the book can change the queries and run them over that data. The benefits of data sharing come from the expression of the data in RDF; the benefits of live access to the data comes from hosting on data.world.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.