Adding semantics to graph databases with Grakn. Part 1
Hello. I’m Michelangelo and I’m part of the Early Adopters Program at Grakn Labs. We are developing a software stack for structuring, exploring, and adding functionalities to graph databases. I have been using the platform for no more than a few days and yet have managed to produce something interesting. This is the first part of a series of posts recounting of my experience. You can find the second here.
Relation. Relation. Relation.
Call them networks, graphs or any other name, the fact is that they are everywhere. We live in an interconnected world and if we are to be able to analyse the ever increasing amount of data around us, we need tools that help us capture that complex web of information.
It is well known that relational databases can be particularly poor for this job. The fact is that relational databases, despite their name, are not well-suited to manage highly complex networks of relationships among data.
And that’s where graph databases come into play.
In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. [From the Wikipedia page on graph databases]
Of course, graph databases have their drawbacks too. You can have a look here for a more detailed discussion, but for the moment let us focus on a couple of them.
One of the strengths of graph databases is their extreme flexibility, but that can be a double-edged sword. Imagine that you have a list of your friends stored in your database, their favourite meals, and the ingredients needed to prepare them. Maybe you want to know who to invite for dinner given what you have in the fridge.
I don’t know — I’m sure you have your reasons.
If everything is fine, you know that your pal John likes omelettes, which of course require eggs to make.
The problem is that in the database there is nothing stopping you from having relationships that make no sense. If you are not careful, you might end up thinking that omelettes like eggs, which require John to make; you look into the fridge, see no Johns in it and are sad that you can’t invite your omelette friend for dinner.
Jokes aside, complete lack of a database schema can be a huge problem, especially if the database is really large (and typically, graph databases are at their best with huge amount of data).
Another issue that you easily encounter is the query language. Queries immediately tend to become unreadable and their performance can be highly dependent on the way you write the query. Just for reference: you can have a look at how simple Gremlin queries look like.
Enter the Grakn software stack, which tries to address these (and other) problems.
Let’s see what we can do with it.
In this and a series of upcoming posts I will describe my experience as a very early adopter, with no prior knowledge of (pardon the pun) knowledge networks and zero experience in database development. I’m going to describe what I achieve, and also document the problems I encounter and the solutions I manage to implement.
It will be exciting.
DISCLAIMER: In my years of teaching more or less advanced mathematics, I have learned that it is often better, especially when introducing a new subject, to sacrifice accuracy for the sake of clarity. For this reason (and also because I’m new to the subject) I might say things that are not precise nor perfectly right and use terms that from a technical point of view are not exactly equivalent. Just bear with me. It is not that important, just be aware that once you get more into the subject you will realise that I have been far from rigorous.
You can find the complete code and data for this project on my GitHub. NOTE: The code below was correct for early versions of Grakn. Since it was published, we have introduced some changes to Graql syntax as the platform has matured, and we have yet to update this blog post.
The schema is not the data
A Grakn graph is made of two layers. The data layer and the database schema.
The sentence above helps us understand how Grakn can solve one of the previously stated problems. We have the data, which is stored in a graph database, and on the top of it we add a schema, which is a set of rules and constraints that help us structure the data and avoid having relationships among data that make no sense.
Think common sense rules in linking and structuring the data, if database schemas are not your cup of tea.
I happened to have around (don’t ask) a list of oncologists — i.e., cancer researchers — with their co-authorship connections, that is, pairs of researchers who have written papers together, with how many papers each pair has co-authored. So that seemed a like good starting dataset to put into a Grakn graph.
The graph structure that we will try and model is thus pretty simple: the vertices are oncologists and there is a labelled edge between two vertices if the two researchers corresponding to those vertices have written some papers together. The label on the edge is the weight of the relationship, that is, the number of papers the two researchers have co-authored.
This is far from a completely useless example. There are in fact a number of reasons for doing this: such a network might help you understand who’s who in oncology, possible closely knitted groups of researchers, who are the most prolific writers, who to ask (or to avoid) for refereeing papers if you are a scientific editor and so on.
Just for reference, the graph I’m going to model has a couple of hundred vertices and about twice as many edges.
The data is contained in a simple tab-separated file and has the form
Now that I have briefly introduced the problem and outlined why you might want to use Grakn, I will be able to describe the process I went through to build the data model and actually loading the data, just to give an idea of how it does look like. In the next post, we will start actually building the schema.