Streaks of light over a cityscape at night

Demystifying DataLoaders & Solving the N+1 Problem

Joe Staller
Prodigy Engineering

--

Heads up: I’m going to assume familiarity with common GraphQL terms on your part, like schemas and resolvers. If you need a refresher, I recommend checking out GraphQL.org

If you work on a GraphQL project for long enough, you will eventually come across a certain kind of issue that the technology is prone to: The infamous “N+1” problem. If you’re unfamiliar with the N+1 problem, it’s an inefficiency that occurs when we naively fetch data that resolves a relationship.

For example, let’s say we want to fetch some Movies, and for each Movie, we want to fetch its Review Scores from different websites. If we set up our relationships naively, we could end up making one call for all Movies, and then N separate calls (where N is the number of Movies we’ve fetched) for each Movies’ Review Scores. Hence the name: the N (for each individual call to get Review Scores) +1 (that initial call to get all Movies) problem!

This is common in GraphQL projects due to the nature of resolvers fetching data independently from one another. To give the same example from above where we want to fetch Movies and their Review Scores, a GraphQL implementation could look like:

If we imagine that MoviesService.findAll() makes a call to some Database or RESTful API to fetch all Movies, and MoviesService.findById() similarly makes a call to the same data store, but only fetches a single Movie, we can imagine a case where we run into the N+1 problem. For instance, the following query:

Would make one call for all Movies (using the Query->movies resolver), N calls for every IMDB score (using the MovieScores->imdb resolver), and another N calls for every Rotten Tomatoes score (using the MovieScores->rottenTomatoes resolver). This is actually a 2N+1 problem! Oh no!

Enter: DataLoaders

DataLoaders are a generic utility that are used to batch queries to data stores (Databases, RESTful APIs) and cache the results for the length of a single request/response cycle. DataLoaders differ from other caching strategies that use (for instance) Redis or Least Recently Used (LRU) in-memory caches by not persisting cached data across user requests.

If we applied DataLoaders to the above example and batched the queries for Movies by ID, we could reduce the number of outbound queries from 2N+1 (recall: 1 for all Movies, N for all IMDB scores, and N for all Rotten Tomatoes scores) to 2 queries: 1 for all Movies, and 1 query that batches all MoviesService.findById() calls into a single query that fetches many Movies by their IDs. We might even be able to get this down to 1 query total if we use DataLoaders to their fullest extent.

From Theory to Practice: Implementing DataLoaders

Let’s assume our MoviesService from before looks like this to start with:

We can ignore the exact details of what the API module does — for our example we can assume it will return Movie data for us.

To start with, we will need to require the dataloader module, and create a new DataLoader that will fetch many Movies by their IDs. We will be using the ID of the Movie as the key for the DataLoader to uniquely identify the Movie data in the cache. Then, we need to map the results back to the keys our DataLoader received. We should keep in mind two key points when mapping our results back to the keys: the order of the results must match the order of the keys, and the length of the results must match the length of the keys.

This DataLoader is set up to use the Movies’ IDs to batch fetch Movies from our underlying API*. We can use any attribute we want to query for data with a DataLoader (or even multiple attributes!), but in this case we’re using ID as it’s typically what’s used to identify entities in data stores.

I’ll also note that we have opted to create our DataLoader inside the closure we have defined for our Service. This covers our point from above about the lifecycle of a DataLoader typically being one request/response cycle.(For more detail on how this Service initialization works, skip to the bottom of this article, which will link you to a full working example of the code outlined in this article in a Github repo).

*Note: not every RESTful API supports querying for multiple entities at once by an attribute. If you’re querying a 3rd party API that doesn’t support batch fetching, DataLoaders can’t be used to their fullest extent.

Now that we have our DataLoader up and running, it’s time to use it! The .load() method accepts a key to load data for, and returns a Promise that will resolve to the entity (if found) or null (if not found).

The only change we’ve had to make is, instead of going straight to the API, MoviesService.findById() now calls our DataLoader to load a Movie by its ID.

This is pretty good! If we make the same fetchMovies query from before, we would now make 2 total calls to our Movies API: one to fetch all Movies, and one batched call to fetch all Movies by their IDs. The MovieScores->imdb resolver and MovieScores->rottenTomatoes resolver both call MoviesService.findById(), which calls our DataLoader, which will batch all the requests into a single call to the Movies API.

Now, when you read that query analysis, you may have noticed that we are effectively making the same request to the Movies API twice: we just want all Movies. Enter: the .prime() method. .prime() allows us to put data into our DataLoader cache ahead of time, “priming” the cache for future use. Continuing our example, that might look like:

In MoviesService.findAll(), we have added a line to .prime() our DataLoader with Movie data. The .prime() method takes a key as its first parameter (in our case, the Movie’s ID) and the value to store for that key as the second parameter (in our case, the Movie data).

With this final optimization in place, that fetchMovies query from before would make a single outbound request to the Movies API. From the Query->movies resolver, we would fetch all Movies, prime the DataLoader with Movie data, and then have our calls to MoviesService.findById() load that Movie data out of the DataLoader’s cache. Wonderful!

To recap…

We can use DataLoaders to address performance issues in GraphQL projects that the N+1 problem creates. We can use DataLoaders to batch outbound requests to data stores and cache the results, typically for the length of one request/response cycle. We can use .load() to fetch data out of a DataLoader, and .prime() to put data in ahead of time, further optimizing our query performance.

For a complete working example based on the code discussed in this article, I’ve made a Github Repo available that demonstrates working with DataLoaders in a GraphQL project. If you want to learn more about DataLoaders, I recommend the ReadMe of the main DataLoader project.

Additional Resources

Interested in joining the team? Check out our open positions!

--

--