Graph DB — diving into JanusGraph part 2

Marcelo Coraça de Freitas
FiNC Tech Blog
Published in
4 min readApr 17, 2017

This is the part 2 of a 2 articles series where I give an introduction on JanusGraph and Graph Database technologies.

In part 1 I went through some basic concepts and tried to explain that you shouldn't be considering this as a magic solution to all your problems. In this part however, I will be focusing on more practical aspects and will try to explain how the different parts surrounding JanusGraph are connected.

Backend of the Backend

Your JanusGraph server uses a storage back-end and an (optional, but very recommended) index back-end.

If your reaction right now is "what?" in a mix of despair and confusion, I get ya. I've been there. That is why I think calling this a server is not necessarily the best idea. But now get this: you don't run JanusGraph server. You run Gremlin server with JanusGraph as the graph implementation.

It took me a while to sort through this out, so I decided to explain this in this article series. Actually, this is one of the main reasons for writing this in the first place. How to write queries and populate your DB is very well documented in the official docs. No need to cover this here.

But understanding some concepts is key for you to have a smooth ride in the JanusGraph world. So, please, read the following as many times as you need and make sure you completely understand the next sentence:

JanusGraph is an implementation of the Gremlin stack that stores and indexes the data in distributted a back-ends in an efficient way that let you handle huge amount of data efficiently.

This has some implications:

  1. JanusGraph can be embedded into a generic server, like gremlin-server.
  2. You can use JanusGraph as a Java library in your app.
  3. You can switch to different storage backends depending on your needs.
  4. Indexing is optional, but you can also choose which indexing backend suits you better.
  5. A JanusGraph instance is a program that connects to those backends and runs some graph processing using its libraries.

Also, JanusGraph is aware of other running instances, and you can even kick other instances from one of the nodes in your cluster.

(not so) practical example

I thought a lot about writing a complex example with actual code. But this would take a long time to produce and wouldn't come even close to the quality of the examples contained in the official docs.

Instead, I decided to write a simple example to show where Graph Databases could effectively optimize your app.

Imagine an app like twitter, where users can follow each other:

In this example data set, the edges mean follows and vertices are the users. The arrow termination means who is following who and a double arrow means they follow each other.

This is the actual Graph model for this. Easy to understand and explain to non-technical people. The SQL mapping for this, however, would be something like:

  • table users: id and name
  • table follows: source_user_id, target_user_id

Now, imagine you wanna list followers of followers as potential user recommendation. With SQL your query would require at least 2 joins (assuming you are only interested in the user ids). And the result wouldn't filter out the users the current user is already following.

On top of that, it is completely ignoring the other side of this relationship.

A faster and more complete solution using a graph query could be:

g.V().has('userId', userId).out('follows').aggregate('f').out('follows').where(without('f'))

Breaking down this query a little bit:

  • g.V(): creates a graph traversal, which is an object that will be used to describe a strategy on how to navigate through the graph.
  • has('userId', userId): looks for a vertex with the property userId set as the value of the variable userId.
  • out('follows'): this appears twice; it tells the traversal to go through the edges labeled as follows in an out direction (looking for users this particular user follows).
  • aggregate('f'): get the results of previous operation and save in this "variable".
  • where(without('f')): filter out users aggregated in the previous step.

This strategy is then used to go through the vertices. You can also limit how many vertices are treated in any step along the way, to limit your results. And you could add weights to the edges based on properties such as similar attributes the users have (gender, age group, country, …). If instead of using out('follows') we used both('follows') it would consider both directions and using in('follows') it behaves the opposite way.

In a small enough data set this would be enough. But with real life applications, things are not that simple. For example, you will likelly have a user that follows hundreds of thousands of users. In this scenario you do need to limit your query, a subject that deserves its own article. But this extremely simple query already filters out the users being followed and on its own is already more powerful than the rather complex 2 joins SQL query.

--

--