Graph representing the metadata of thousands of archive documents, documenting the social network of hundreds of League of Nations personals. Published in: Grandjean, Martin (2014). “La connaissance est un réseau”. Les Cahiers du Numérique 10 (3): 37–54

OrientDB — Graph databases

Frank Dorssers
Orikami blog
Published in
7 min readJul 6, 2016

--

For a recent project we were looking for a database management system which could easily support densely connected data, with two important aspects:

  1. Finding the shortest paths
  2. Visualizing the network

Think of social media data, a collection of users, connections between friends, and possibly also information about where the users are from.

The initial types of databases one might generally think of are based on SQL (MySQL, PostgreSQL) and NoSQL (MongoDB).

In this blog post I’ll be looking at how we can store a small set of social media data in different formats.

Sample data

Below is a visualization of the network that we want to store in different formats:

There are a total of four users and connections, each user with a little bit of personal information.

SQL

SQL is able to handle relational data with foreign keys and joins. To store this data we would need two tables, one to store the the profiles themselves, and one to store all connections. First we store all the users with their id and name:

+----+---------+
| id | name |
+----+---------+
| 1 | Alice |
| 2 | Bob |
| 3 | Eve |
| 4 | Charlie |
+----+---------+

Next are the connections:

+---------+-------+
| id_from | id_to |
+---------+-------+
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 3 | 4 |
+---------+-------+

A large advantage is that the database is normalized. If one of the users changes his or her name, we only have to change a single field.

However how would you find the shortest path between, for example, Alice and Charlie? You could do this depth-first, breadth-first, or something like Dijkstra’s algorithm. All which would have to be done manually, outside of the database. While in this sample it is trivial to find the shortest route, when you’re working with millions of nodes and edges, you would like something which is robust and efficient.

Visualizing the network, or just a part of the network, is also not trivial with SQL. Either you would have to do it yourself, by going through all relevant relations and visualize it, or see if there is a third party program which can handle it.

NoSQL

NoSQL DBMSes like MongoDB generally do not support relationships, document-oriented databases like these tend to store these documents either nested, or with references to other items.

So how would our data look like in MongoDB. If we were to store the data nested it would look like this:

[
{
“id”: 1,
“name”: “Alice”,
“friends”: [
{
“id”: 2,
“name”: “Bob”
},
{
“id”: 3,
“name”: “Alice”
}
]
},
{
“id”: 2,
“name”: “Bob”,
“friends”: [
{
“id”: 3,
“name”: “Eve”
}
]
},
{
“id”: 3,
“name”: “Eve”,
“friends”: [
{
“id”: 4,
“name”: “Charlie”
}
]
},
{
“id”: 4,
“name”: “Charlie”,
“friends”: []
}
]

Using references to other objects we can shave down quite a bit of duplicate information:

[
{
“id”: 1,
“name”: “Alice”,
“friends”: [2, 3]
},
{
“id”: 2,
“name”: “Bob”,
“friends”: [3]
},
{
“id”: 3,
“name”: “Eve”,
“friends”: [4]
},
{
“id”: 4,
“name”: “Charlie”,
“friends”: []
}
]

In the first instance with the nested information we have immediate access to some information about the friends. However because this is stored denormalized, if a user ever changes his name, this has to be changed in multiple places in the database.

The second instance is normalized as it tries to minimize the amount of duplicated data.

In both instances we need to look up the friends at a certain point. Storing it nested allows us to view the first degree of friends immediately, but second degree and onwards we need to look them up.

This presents the same problems as SQL, no easy way to find the shortest path and no easy way to visualize the connections.

OrientDB

How exactly do we store the data to allow us to easily find the shortest paths and visualize the data if the common SQL and NoSQL solutions aren’t optimal for us? With a graph database.
OrientDB is a NoSQL DBMS, which supports document, object, key/value and graph objects. What exactly does this mean?

Classes and properties

First of all we can see that OrientDB takes more after other NoSQL databases rather than SQL. However there is an overlap between the two. Just like there are tables in SQL, there are classes in OrientDB. At the heart of these classes are edges (`E`) and vertices (`V`), which make up the graph. Any new class that you want to add, for example a class `User`, can extend `V`, while another class, like `Friends`, can extend `E` and can be used to connect users.

On top of a default Record ID property (rid), which allows objects to be retrieved with a performance of O(1), these classes can also contain predefined properties, for example `userId`, which can be defined as being a `long`, mandatory, non null, and needs to have a minimum of `0`. Each time you now try to add a user it checks whether `userId` is given and whether it is valid.
This property can also be indexed to speed up retrieval (just like in most SQL and NoSQL databases).

While it is possible to set properties, it is not required, like in many other NoSQL databases. And you can add objects without defining these properties. However the classes are still required.

Graphs

A graph generally consists of two elements: Vertices (also commonly known as nodes) and edges. As seen in the previous section these are represented by the `V` and `E` classes, which can be used to create nodes and vertices.

In OrientDB you actually have two possibilities to create edges. First is the normal way, which uses an instance of the edge class. So if we were to connect Alice to Bob we have three objects: Alice, Bob and an instance of the edge, which are linked to each other. Below is an example:

+-------------------+   +------------------+   +----------------+
| User Vertex | | Friends Edge | | User Vertex |
| Alice | | | | Bob |
| #1:1 | | #2:1 | | #1:2 |
+-------------------+ +------------------+ +----------------+
|out_friends: [#2:1]|<->|in_friends:[#1:1] | | |
+-------------------+ |out_friends:[#1:2]|<->|in_friends[#2:1]|
+------------------+ +----------------+

When using lightweight edges it leaves the Friends Edge out and Alice and Bob are directly linked. However in both cases moving between Alice and Bob is very efficient, as retrieval of objects based on the RID is always O(1).

Example

So lets look at our given example. After creating the required classes for User and Friends we can add our data:

> CREATE VERTEX SET id=1, name="Alice"
> CREATE VERTEX SET id=2, name="Bob"
> CREATE VERTEX SET id=3, name="Eve"
> CREATE VERTEX SET id=4, name="Charlie"
> CREATE EDGE Friends FROM (SELECT FROM User WHERE id=1) TO (SELECT FROM User WHERE id=2)
> CREATE EDGE Friends FROM (SELECT FROM User WHERE id=1) TO (SELECT FROM User WHERE id=3)
> CREATE EDGE Friends FROM (SELECT FROM User WHERE id=2) TO (SELECT FROM User WHERE id=3)
> CREATE EDGE Friends FROM (SELECT FROM User WHERE id=3) TO (SELECT FROM User WHERE id=4)

So what else is required before we can see our graph and find the shortest path between Alice and Charlie? Nothing!

OrientDB’s web interface allows you to run queries and view the result as a graph. Requesting all nodes and edges results in the following graph:

So how do we now get the shortest path between Alice and Charlie? This is done using the shortestPath function.

> SELECT shortestPath (
(SELECT FROM User WHERE name=”Alice”),
(SELECT FROM User WHERE name=”Charlie”)
)

The query above gives us all the vertices that make up the shortest path, these are Alice, Eve and Charlie. While just getting the nodes is useful, it would also be nice to be able to actually see what this path looks like. This can be done with a slightly more extended query:

> select expand(path) from (
select shortestPath($from, $to, ‘OUT’) as path
let $from = (SELECT FROM User WHERE name=”Alice”),
$to = (SELECT FROM User WHERE name=”Charlie”)
UNWIND path
)

This query shows us exactly what the path looks like.

Conclusion

After going over different ways of storing, and working with, highly connected data, I hope this post has given a somewhat deeper insight into what some possibilities are and how a graph database like OrientDB might solve some problems that can arise.

However this is by no way an exhaustive post. There are still many other possibilities, for example other graph databases like Neo4j and ArangoDB, or variations of SQL like T-SQL which also offers some interesting possibilities.

Lessons learned

  1. Storing social network data can be done in a lot of ways
  2. There are limited ways to traverse this data in an effective way
  3. Graph databases solve the issues above
  4. OrientDB can be used to easily store social network data and traverse the connections to find closely related people, reason about connections, and a lot more

--

--