Building a high-performance backend using GraphQL and DataLoader
Whether you’re building your app using React, Vue, or Angular, GraphQL is the natural choice to build a modern API. However, anyone familiar with the technology knows the optimization issues that arise when writing a GraphQL backend. For some, response-time and database load may not be critical. Apps that don’t pull a lot of data may be doing just fine without any particular optimization. Unfortunately, this isn’t our case. Our product, an equity management platform called Equify, fetches vast quantities of data each time a new screen is loaded and the database had to handle hundreds of queries per request. We had to come up with solutions quickly in order to boost our GraphQL server and reduce response time.
The pitfalls of GraphQL 🤒
The root of the issue is simple enough to understand: GraphQL fields are stand-alone and responsible for fetching their own data. With a naive approach, the same data might end up being fetched multiple times.
Let’s consider the following scenario: You want to show the user its first 5 friends and the best friend of each one of them:
Here R, T, E, N, and P are all my friends. I am the best friend of R and P, W the best friend of E and N, and so on. The query that fetches the data is pretty straightforward:
Now if the fields friends
and bestFriend
each need to make a call to the database to resolve, we could end up with 7 calls total. Initially, this may seem reasonable. However, when you look at it closely you realize the situation is far from optimal…
As you can see, the backend ended up fetching the same users multiple times:
- A (me) was fetched 3 times, once in the
me
resolver and twice in thebestFriend
resolver - R was fetched 2 times, once in the
friends
resolver and once in thebestFriends
resolver - W was fetched 2 times also, both in the
bestFriend
resolver
A less than optimal situation quickly arises in which your backend is overworked, making several calls to pull the same data over and over. What’s worse, the strain on the backend contributes to creating a sub-par experience for your users.
The ideal situation would be to fetch each user only once. Essentially, we need to cache database calls in a way that is decoupled from our resolvers logic:
But caching is not enough. When dealing with an array of resources, like our friends' list, each instance will fetch its own data. The bestFriend
resolver is generating 5 calls when 1 could have been enough. This problem gets worse as the size of your collections increases.
From the outset, our gut feeling on the question is to batch queries and bring together similar calls to the database:
This insight is definitely not obvious: why would one big query be more efficient than a few small ones? Let’s shine some light on the matter:
- Database systems are highly optimized to handle large amounts of data
- Network overhead only really happens once (it can be quite significant)
- The database engine parses and optimizes the query only once
Ideally, we want the batching mechanism to be completely transparent to us, meaning it must be decoupled from our logic and we should still be fetching single entities:
Now that we’ve identified the problem, let’s jump right to the solution.
The silver bullet: DataLoader ✨
The reason we turned to DataLoader is that it addresses the two issues we needed to resolve in priority to make our solution possible: batching and caching. It isn’t specific to GraphQL, but it fits perfectly.
DataLoader is simply a thin wrapper around a function that takes in a set of ids and returns a set of values in a promise — rendering a clean API where you can pass a single id and get a single value in return. All you have to do is write the batching function and DataLoader does the rest.
Basic implementation:
The way DataLoader handles batching is pretty clever, it puts all requests in a queue and sends the ids to the fetch function at the end of the event loop. All calls that occur within one tick of the event loop are automatically batched.
This approach works particularly well with GraphQL because the resolver tree is called layer by layer. In other words: all resolvers of the user’s profile picture of all posts are called within the same tick of the event loop and DataLoader is able to batch the calls as expected.
Caching, on the other hand, is much simpler. With DataLoader, the batch function is not called a second time if you request an id that has already been loaded, ensuring that each id is only loaded once. When requesting the same id multiple times, DataLoader returns the same promise every time:
Easy, right? Let’s implement it.
The real deal 🧐
Job number one is creating a per-request cache, this prevents having issues with permissions or cache invalidation. It also happens to be the most commonly-used patterns out there. In order to replicate it, you’ll simply need to create new loaders every time a request is handled. Using Express, you can easily add a middleware to your app:
To make the loaders available to each resolver, pass it to the context:
The beauty of DataLoader is that with a few lines of code, you can benefit from caching and batching and significantly improve performances. In production, we lowered by 95% the number of database calls and had a 4 fold improvement on response time. And that makes us happy!
The tricky part 🙃
The last thing we need to cover are many-2-many and one-2-many relations. It might be tempting to create a loader per relation to batch things up, unfortunately, that would break caching. You could end up pulling the same data multiple times, once per relation it appears in. The recommended way for this scenario is to create a loader for relations that only fetches the ids. Then the loaders we already created are used to actually fetch the data from the ids.
To achieve this we can pass an object to the load
method instead of a just an id string. We will call those object requests.
At Equify we decided to shape our request like so:
- modelName: Name of the model to query to find the list of ids
- foreignKey: Name of the column that holds the foreign key
- id: Value of the foreign key
- foreignIdentifier: Name of the column to read the id from
Let’s just see an example right now. To query the list of posts of a user we would do the following:
This should produce 3 queries:
Try to match the parameters of the request to the second SQL query, you should have your ha-ha moment now 🤩
Many-2-many relations work the same way. To fetch the list of posts liked by a user we simply do the following:
This would produce pretty much the same queries. As you can see one-2-many and many-2-many are exactly the same in term of implementation and database calls. Because we only need the ids we can directly query the join table for many-2-many relations without having to use a join in the query.
Let’s have a look at the batching function now, remember we need to take an array of requests instead of an array of ids:
To help DataLoader cache your requests you must specify a function that builds the cache key like so:
I will not go into details on how to implement the findRelationIds batch function, but I still want to give you an idea of how we implemented it:
- Group requests by modelName.
- Make one call to the database per modelName.
- Map each request to a list of ids using the results.
You should have all the knowledge you need to tailor it to your application, congratulation! 👏
Conclusion
Given the ease of implementation and the enormous benefits that DataLoader can bring, I hope I convinced you to use it to boost your backend. Tell us in the comments how you implemented it and what is the magnitude of your improvements!
Feel free to share this article with anyone who might have an interest in performance on a GraphQL backend, sharing is caring 🙌
If you like this article don’t forget to clap, and if you have any question don’t hesitate to reach out to me 💬