Giving Neo4j a diet, so it runs faster

A year back, Kaushal mentioned in his first blog post about using Neo4j as our main database. We still use it to store the majority of data from the social interactions on Roposo. A graphDB is the best known way to store and query highly connected data.

When we started, we were keeping all the data for all the entities of our system in Neo4j. Image URLs, profile information, complete text of comments, even the long texts from the posts that were added. This approach led to our graph becoming heavy with information that was not contributing to “building” the graph.

Neo4j works best when it can keep all the hot data in memory. So if nodes are heavy with meta data, newer queries would potentially kick out other nodes slowing down further queries.

By the time our Neo4j data hit around 32gb, of which 18gb was contributed by string objects alone, it was time to give Neo4j a diet plan.

Our goal was to move the string data out of Neo4j, which was not used in any our queries.

We decided to use mongo to store this non-identifying data . We were already using it for other non-connected data. Translation — less developer time for setting things up, testing and our team already had a bag full of learnings on using mongodb optimally.

So we then stripped Neo4j of all the non-identifying string data and dumped it in Mongo.

And our Neo4j found his partner in query :D

This was the first time I had a chance to see and work on the classic problem of migrating databases (sort of!) in live environment without a maintenance window. I was so excited that all of my friends, including those who do not work in technology, knew how Roposo data was being ferried around in the background during that week.

For a day we wrote all the information to two databases, which obviously caused some lag, but not enough to cause complications. (We stopped doing this when we were sure that our mongo data was completely consistent with the Neo4j data and thus, could be the source for read queries.)

For most read operations, however, we have to now query both databases. Thanks to multi-threading, this hasn’t increased our time, only the number of threads that do (very fast) IO.