Should we move to a Graph Database?

José Moreno
Fever Engineering
Published in
5 min readJul 23, 2021

Is this type of database the right choice for the social aspect of our app?

Illustration by BIG MOUTH

Recently we have started analyzing new ways of providing a better user experience in our app when it comes to coordinating attendance to the experiences provided. Questions like who would be up to attend a Candlelight Concert or when would my friends be available to come with me to the Stranger Things Drive into Experience can be better answered within the context of our app and a network of friends within it.

Nowadays, if you enter the realm of implementing a social network, the hype will almost invariably point you into a single direction, the almighty Graph Databases. There is no doubt that the modeling of the problem fits this type of system extraordinarily well. Moreover, the benefits advertised include:

  • There is no need for rigid SQL schemas for our data, allowing developers more flexibility when it comes to changes in the data structure.
  • Expressive query languages that will simplify common queries in which relations play a prominent role.
  • High performance and scalability, querying the graph with milliseconds latency.

Right, case closed, graph databases are the winners, there is no contest here, or is it?

In our current stack, PostgreSQL is our usual database of choice. So before jumping into the Graph DBs wagon, we should at least let our good old friend in the relational world know why we are dumping it, and just to be fair, let it defend its honor.

So we are going to prepare a sample dataset where we will have a couple of entities and some relations between them. Then we will run some sample queries and compare their performance.

DB systems under test

We will compare one relational database with two graph databases.

  • PostgreSQL (12.4). This is the database we are using in most of our projects. Reliable, tested, and well established.
  • Amazon Neptune (1.0.4.1). For the most part, our current infrastructure is in AWS, so this is a natural candidate. A fully managed option, with low latency, read replicas that can execute more than one-hundred thousand graph queries per second. Continuously backs up on S3 and is optimized to store billions of relationships and query the graph with milliseconds latency.
  • Neo4j (3.5). Reference Graph DB as it is one of the first graph DBS technologies developed. Open source requires a commercial license. Has its own managed hosting platform Aura.

The hardware used will be hosted in Amazon AWS. These were the configurations:

  • PostgreSQL: db.t3.medium, 2 vCPU, 4 GB RAM. Storage type: General Purpose (SSD)
  • Amazon Neptune: db.t3.medium, 2 vCPU, 4 GB RAM. Storage type: DB cluster
  • Neo4j. EC2 t2.medium. Since this is not natively supported by Amazon this machine was the closest. Storage type: EBS volume type gp2

The data set represents a network of users, which follow other users and go to specific events.

  • Users: (id, username, first_name, last_name, email, is_staff, image)
  • Follow: (user_follower, user_following, since_date)
  • Events: (event_id, title, is_sold_out, starts_at)
  • GoingTo: (user_id, event_id, on_date)

We will run the queries on two datasets.

Number of vertices and edges in our two data sets

The queries run are the following:

  • Fetch a single user. To have a baseline time.
  • Follows going to a specific event. Query using relations among all elements of the graph and limiting by an event node.
  • Events that users following another user are going to.
  • Events that users following another user (a) are going to and the user (a) is not going.
  • Which user following user a is going to more plans.
  • Follows of follows of follows of follows of follows going to a specific plan.

Results

We ran the queries 100 times and averaged the results.

Response time for queries with 50 million follows:

Response time for queries with 100 million follows:

Comparison of time when doubling the number of follows for each of the engines:

The last query (follow of follow of follow of follow of follow) for all engines and sizes:

We can see in the results that PostgreSQL is the best performer for all the queries executed. How could this be? Well, the key here is most likely we are not actually using the graph databases under conditions where they can actually shine. Graph databases would perform much better for true graph queries. For example, the shortest path between nodes. That type of query with an undefined number of edge hops would be extremely difficult to express in a relational database and would also be pretty slow. However, the queries we are executing here are, for the most part, in all three systems accessing the same type of information and have a predefined “distance” between the nodes. In this context, as long as the index for PostgreSQL fits into memory, the relational database is faster as it has been optimized over the years for this use case.

Conclusions

What is the conclusion then? Well, mainly that you should be very aware of the type of information that you want to store and query afterward. Only looking at the topology of the data is not enough to make the final decision on which system to use. In our case, we will stick to PostgreSQL for now, as it offers better performance for our current use cases and will give us better development speed as everyone is familiar with it. We will always have graph databases on our radar for future applications, but as we have seen it is not always the best option. It is also a relatively new technology, so it might catch up with relational databases at some point and make all the difference. Just not yet.

--

--