Adding semantics to graph databases with Grakn. Part 3

Continuing the Schema

Michelangelo Bucci
Vaticle
5 min readAug 25, 2016

--

Hello. I’m Michelangelo and I’m part of the Early Adopters Program at Grakn Labs. We are developing a software stack for structuring, exploring, and adding functionalities to graph databases. I have used the platform for no more than a few days and yet have managed to produce something interesting. Here’s a recounting of my experience.

This is the third post in a series. If you haven’t read the first and second posts, I strongly advise that you do. Just to recap: we are building a Grakn graph containing a number of oncologists (i.e., cancer researchers) linked by their co-authorship relationships, together with how many papers co-authors have written together.

Last time, we used Graql, which is Grakn’s own query language, to start building the schema for our map of oncologists and we inserted into the co-authorship relationship. Here’s a picture of where we are at the moment.

We have one last piece of information, albeit a very important one, to add for the schema to be complete.

We need to tell Graql who the main “actors” are.

The nodes of our graph.

The entities.

In other words, we must put the oncology into the schema.

Building the schema: adding entities

Recall that the data file we have is structured like this:

In order to add the oncologists to the schema, it would be tempting to use their names as unique ids. However, there are a couple of problems with that reasoning. First, the names are not guaranteed to be unique (tell that to our “Smith J”, for example). Second, it is a bad idea to have spaces (or non-printable characters) in your ids: even if Graql can handle them just fine, they can often be a source of headaches, so it is better to avoid them altogether, as they are not really needed in this case. The solution to this problem is to add names as resources to the oncologist entity.

Let’s recap: we want to insert an entity (that’s what we call the main actors in our schema) and to call it “oncologist”. We also want to be able to attach resources to the entity, and we want the oncologist entities to be able to be connected by co-authorship relations (i.e., we want the oncologist entity to be able to play the roles of author_X and author_Y). At this point we should be able to predict how the code looks, and, in fact, there are no surprises.

NOTE: The code below was correct for early versions of Grakn. Since it was published, we have introduced some changes to Graql syntax as the platform has matured, and we have yet to update this blog post.

We just need to add a string resource to attach names to our entities and we are done. Easy peasy.

You probably will have noticed that we have used the datatype to declare what kind of data is contained in our resource. Other possible options are long for integer numbers (which we’ll use in a couple of paragraphs), double for real numbers, and boolean for True/False values.

Before exploring how to add data to the graph, let’s have a look at the complete code.

The schema we just built looks like this:

Adding data

At this point, adding the data is really easy. In order to insert the oncologists’ data, we have to create a unique identifier for each oncologist, declare it an instance of the oncologist entity we have defined in the schema, and attach a name to it. It is actually easier to do it than to describe it:

Did you notice the “has” keyword that we used to attach the name resource to the instances of the oncologist entities? That is the magic of resources. Under the hood, we have defined a somewhat hidden relation when we declared that that the oncologist entity type “has-resource” oncologist_name. Now we are adding instances of that relation with the “has” keyword. We could have actually done all this by hand, but using the “has” syntax makes everything much easier and smoother.

Adding instances of the co-authorship relationship is no more complicated. We just need to link two oncologist for each of them using their id and then add the number_of_papers resource. We do not even have to specify ids for them.

Rinse and repeat.

Conclusions

Before leaving, let me show you, once again, how the completed graph with the data looks.

In the first post of this series, I had anticipated that the oncologist data I am using is made of a couple of hundreds oncologists linked by about twice as many connections. The reason why the picture above looks as it has many more nodes and edges than what you would expect from the data alone, is that the Grakn graph contains both the schema and the data layer together in the same graph, so we have a lot of connections between the data and the elements. If you could zoom into one of those “flowers” on the right, you would see that at the center there is a role type (author_X, for example) and the “petals” are a few hundreds instances of that role (author_X, author_X, author_X… just with a different shape and colour).

Although putting the data and the schema into the same graph makes the above picture quite messy and not particularly useful, it allows us to conveniently have everything stored in one place, so we can explore and modify the graph as we prefer.

We are almost at the end of our short journey. We have finally built the complete schema for the oncologists graph and loaded the data. Hopefully I have convinced you that, even at this early stage, Grakn graphs are a convenient and easy-to-use tool to build and store knowledge graphs.

In my next post, which will also be the last one in this series, I will give you a very brief introduction on how to write simple queries to explore the data we have just loaded into the graph.

Stay tuned,

M.

--

--

Michelangelo Bucci
Vaticle

Discrete mathematician/Theoretical computer scientist, learner, curious about stuff.