GraphQL in Scala with Caliban — Part 2: Query optimization

Pierre Ricadat
Feb 3 · 7 min read

This is the second part of an article about Caliban, a library for writing GraphQL backends in Scala in a typesafe, boilerplate-free and purely functional manner. If you haven’t read already, please check Part 1 to understand what this is all about.

In the first part of this series, we’ve seen how to create a simple GraphQL API. However, we didn’t take full advantage of GraphQL capabilities: our schema was so basic that we didn’t need to worry about any kind of optimization.

It is very common to have deeply nested schemas where each inner field might require gathering data from a database. Ideally, we’d like to keep these calls to a minimum. Let’s see how we can deal with this using Caliban.

We are going to use a different API for this post. Let’s say we want a query that returns a list of Orders, and for each order, we want to expose information about the Customer who placed it, the Products the order is made of, and the Brand of each product. Here’s our simple data model:

We also assume that we have a DBService that we can use to query all the data we need from a database. To compare the different approaches we will try in this post, I implemented a simple DBService that returns some fixed data and records how many DB hits are performed (the whole project is available on github).

The query we are going to use to measure the efficiency of our backend is the following:

We want to return a list of the last 20 orders, with customer and product information but without the brand information.

Naive Approach

As we saw in the first part of this series, Caliban requires we write case classes to define the GraphQL schema we want to expose. Here, our root query type will be a case class named Query with a single parameter called orders.

What should be the type of this orders parameter?

  • It requires an argument named count, so we will create a case class QueryArgs(count: Int) and make orders a function from QueryArgs to something.
  • To resolve orders, we will need to make a DB call, so the return type should be wrapped in IO. To make it simple, let’s return a UIO (an IO that cannot fail).
  • We can’t return a list of the Order data type, because it only contains IDs of customers and products. We need to denormalize the data by creating an OrderView data type that will contain everything that is queryable.

In conclusion, our orders parameter will have the following shape: QueryArgs => UIO[List[OrderView]].

Now what is the problem with this implementation? It doesn’t actually consider the fields requested by the client besides the first one. To implement the resolver for orders, we first need to get the list of orders and then for each order, we need to get the products, the customer and the brand data in order to build the OrderView object, even if those were not required by the client.

If we run our test query with the dummy DBService, it causes 101 DB Hits:

  • 1 hit for getting the list of orders
  • 20 hits for getting the customer of each order (there are 20 orders)
  • 40 hits for getting the product data (each order has 2 products)
  • 40 hits for getting the brand data of each product

This is really inefficient. Our query doesn’t even request brand data but because orders returns everything available, we spend a lot of DB hits for nothing. Let’s do better.

Nested Effects

So far we’ve only used effects in the root case classes, but nothing prevents us from using effects in the nested case classes too. Returning an effect basically makes a field lazy: the effect will only be run when needed. As a general rule, we can say that any field at any level that has some cost (e.g. causing a DB query) should be turned into an effect.

Let’s revisit our case classes to apply this simple rule. We will transform OrderView#customer, ProductOrderView#details and ProductDetailsView#brand to effects because those fields require extra DB calls to get customer, product and brand data.

With this simple change in place, running the test query on our dummy test DBService now results in 61 DB Hits. Why? We got rid of those useless 40 calls for getting brand data which the query didn’t require. If the query didn’t include customer or product data, the gain would be even higher.

We are now only gathering the data we really need. But what if the same customer had several orders, or what if the same product was ordered several times? We’re potentially querying the database multiple times for the same thing.

Introducing ZQuery

Caliban comes with a data type called ZQuery that addresses this particular problem. A ZQuery[R, E, A] is a purely functional description of an effectual query that may contain requests to one or more data sources. The type parameters are very similar to ZIO: it requires an environment R, may fail with an E or succeed with an A.

What makes it interesting for our use case?

  • Requests are parallelized: ZQuery collects requests that don’t depend on each other to run them in parallel.
  • Requests are deduplicated and results are automatically cached: identical requests are run only once within the same ZQuery.
  • If a batching function is provided for a given data source, multiple items can be queried at once.

The ZQuery implementation, done by Adam Fraser, is based on the paper There is no Fork: an Abstraction for Efficient, Concurrent, and Concise Data Access, which is the basis for the Haxl (Haskell) and dataloader (JavaScript) libraries. In Scala, Fetch implements the same concept with a few differences.

That means that if we transform our effects from simple ZIO to ZQuery, we will benefit automatically from parallelization and caching. We’ll try the batching a bit later.

To use ZQuery, we need to define 2 things: a Request type and a DataSource.

  • A Request[E, A] is a simple data type that represents a request from a data source for a value of type A that may fail with an E. We need an actual value for each request so that we can compare them and cache them: if 2 Request objects are equal, they will be considered the same and executed only once.
  • To create a DataSource we will use DataSource.fromFunctionM that simply takes a function from our Request type to an effect returning the expected result type. A DataSource also needs a unique name.

We then call ZQuery.fromRequest with a Request and DataSource and we get a ZQuery back. The following snippet shows our new API definition with ZQuery replacing ZIO and how we create the ZQuery for getting Customer data (the same thing should be done for Product and Brand.).

Let’s now run our query: 9 DB Hits! We had a lot of redundant calls, because the same customers and products were referenced in multiple orders. Now each individual customer and products is read only once from the database. That is quite an improvement, but can we do even better?

ZQuery with batching

I mentioned earlier that given a batching function, ZQuery was able to group requests to a same data source and query items all at once. Let’s try to do that.

Our schema case classes are going to be exactly the same, the only difference is how we create our DataSource. Instead of using fromFunctionM, we will use fromFunctionBatchedM, which takes an Iterable[Request] (instead of a single Request) and must return an effect with a Iterable of our result type. That result list should have the same length of the input requests and preserve the order. We will then call another function of our DBService that returns data for a list of IDs. As an example, here’s the new DataSource for Customer:

How’s the result now for our test query? Only 3 DB Hits! 1 for getting our orders, 1 for getting all needed customers and 1 for getting all needed products. If our database supported it, we could even go down to 2 DB Hits by using a common DataSource for customers and products (our Request would then have to be a sealed trait with 2 possible case classes) and query them together.

The ZQuery data type doesn’t actually depend on the rest of Caliban and could totally be used without GraphQL. We expect to extract it into its own library at some point in the future.

We’ve seen different approaches to optimize our GraphQL backend and reduce unnecessary calls to our database. Using ZQuery is not always possible or even needed depending on the use case, but it’s usually a good choice when your schema has a lot of nesting.

As mentioned earlier, you can find the 4 different implementations in this repository. If you’d like to know more about ZQuery, you can have a look at the dedicated page on the Caliban website or come discuss with us on the Discord channel.

In the last part of this series, we will explore wrappers, a new feature of Caliban that makes it possible to implement a wide range of custom behaviors during query or field processing.

Pierre Ricadat

Written by

Software Architect, Photographer

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade