Optimising graph lookups in MongoDB

Graph databases such as Neo4J specialise in traversing graphs of relationships — such as those you might find in a social network. Many non-graph databases have been incorporating Graph Compute Engines to perform similar tasks. Since MongoDB 3.4 release, we now have the ability to perform simple graph traversal using the $graphLookup aggregation framework function.

$graphLookup is documented in MongoDB Jira SERVER-23725. The basic syntax is shown here:

I started playing with this capability originally using the POKEC dataset which represents data from a real social network in Slovakia. The relationship file soc-pokec-relationships.txt.gz contains the social network for about 1.2 million people. I loaded it into Mongo using this perl script. The following pipeline did the trick:

gzip -dc ~/Downloads/soc-pokec-relationships.gz |perl loadit.pl|mongoimport -d GraphTest -c socialGraph --drop

Now we have a collection with records like this:

We can expand the social network for a single person using an aggregation pipeline like this (I’m using dbKoda for convenience here, but you could do the same thing on the mongo command line):

GraphLookup (in dbKoda)

What we are doing here is starting with person 1476767, then following the elements of the friends array out to two levels — i.e.: to “friends of friends”.

Increasing the maxdepth exponentially increases the amount of data we have to cope with. This is the notorious “seven degrees of separation” effect — most people in a social network are linked by 6–7 hops, so once we get past that we are effectively traversing the entire set. Unfortunately, this meant that traversing more than 3 deep causes us to run out of memory:

The graph lookup can only consume at most 100MB of memory, and currently doesn’t spill to disk, even if the allowDiskUse clause is specified within the aggregation arguments. SERVER-23980 is open to correct this but it doesn’t appear to have been scheduled yet.

So I tried building a “flatter” network so that I wouldn’t run out of memory. This JavaScript builds the network and this Javascript runs some performance tests. I tried expanding the network both with and without a supporting index on the connectToField (person) in this case. Here’s the results (note the logarithmic scale):

For shallow networks, having an index on the connectToField makes an enormous difference. But as the depth increases, the index performance advantage decreases and eventually performance matches that of the unindexed case. In this example data that just happens to be at the “7 degrees of separation” but it will clearly depend on the nature of the data.

The $graphLookup operator is a very powerful addition to the MongoDB aggregation framework and continues the trend of providing richer query capabilities within the server. Mastering the aggregation framework is clearly a high priority for anyone wanting to exploit the full power of MongoDB.

Try out dbKoda — an open source, free IDE now available for MongoDB! You can experiment with indexing options, build queries and aggregates and analyse explain plans from within a modern graphical interface. The $graphLookup aggregation shown in this article was generated using dbKoda’s aggregation builder, which makes building complex aggregations a breeze.