Profiling GraphQL

As soon as you put your API in production, people inevitably will come to you at some point and start complaining about slow requests :) At this point it does not matter whether you are exposing REST API or GraphQL or anything else, performance is always an important concern. So starting to think about an API performance early is almost always a good idea.

GraphQL has quite a few advantages when it comes to instrumentation and performance monitoring. Due to it’s type system and fine-grained field resolution mechanism, we can instrument an execution and measure performance of every single field in the GraphQL schema. Sometimes I hear people expressing concerns about N + 1 problem in GraphQL queries which is not present in, let’s say, RESTful approach. I would disagree with this. With REST API we still have the same N + 1 problem, the difference is that with GraphQL client explicitly tells it to a server by asking for different relation fields (e.g. friends of friends). This makes it much easier to measure and optimize fetching of the data. With REST API we just get a series of seemingly unrelated HTTP request, good luck correlating and optimizing these :)

Given this insightful query information, we somehow need to put it into a good use and optimize the performance. In this post I would like to focus on the profiling and identifying the bottlenecks rather then optimizing these (which is the second step). If you would like to read more on the actual optimization, I would recommend you to check this nice post by andrewingram:

When it comes to profiling, I would like to approach it from 2 different perspectives:

  • Ad hoc profiling and debugging, where you just want to quickly investigate specific issue without extra hassle
  • Long term monitoring, where metrics are sent to an external time series DB for more deeper analysis and alerting

Introducing sangria-slowlog

With middleware, Sangria provides a very convenient way to instrument GraphQL query execution and introduce profiling logic. Out-of-the-box Sangria provides a simple mechanism to log slow queries and show profiling information. To use it, your need to add sangria-slowlog dependency:

"org.sangria-graphql" %% "sangria-slowlog" % "0.1.0"

Library provides a middleware (SlowLog) that logs instrumented query information if execution exceeds specific threshold. Here is an example:

import sangria.slowlog.SlowLog
import scala.concurrent.duration._
Executor.execute(schema, query, 
middleware = SlowLog(logger, threshold = 10 seconds) :: Nil)

If query execution takes more than 10 seconds, then you will see similar info in the logs:

As you can see, the profiling information is added in comments which makes it much easier to correlate server GraphQL types and resolve functions with the original query. sangria-slowlog has full support for GraphQL fragments and polymorphic types, so you will always see metrics for concrete types.

As you can see from this profiling information, it is immediately apparent that appearsIn and friends fields on a Human type have some performance issues, even in more non-trivial cases which involve inline and named fragments.

In addition to logging, sangria-slowlog also supports graphql extensions. Extensions will add a profiling info in the response under extensions top-level field. In the most basic form, you can use it like this (this approach also disables the logging):

Executor.execute(schema, query, 
middleware = SlowLog.extension :: Nil)

After middleware is added, you will see following JSON in the response:

metrics field in the extensions contains original query enhanced with profiling information and also different type of metrics: the execution metrics for all types and fields regardless of the execution path.

All SlowLog methods accept addExtentions argument which allows you to include these extensions along the way.

With a small tweaking, you can even include “Profile” button in GraphiQL! On the server you just need to conditionally include SlowLog.extension middleware to make it work:

if (request.queryString contains "profile") SlowLog.extension :: Nil else Nil

Here is an example of how this integration might look like (after about 30 min of GraphiQL tweaking):

Example of interactive GraphQL profiling

If you would like to learn more, I would recommend you to check the docs

GraphQL Monitoring

You can use sangria-slowlog, as the name implies, for logging of slow queries. For a long-term monitoring of a GraphQL API I would recommend you to also send metrics to a time series database of your choice. This makes it much easier to setup alerting and do more interesting analysis and data aggregation over a bigger dataset and longer timeframe.

One of the options is to use Apollo Optics. It is a time series database/service designed for GraphQL. It has nice data visualization and aggregation capabilities (sangria-optics-agent is already in works, though it is still WIP)

For more non-trivial cases (e.g. you are building multi-tenant system, you have dynamic schemas, etc.) or you require a self-hosted/custom setup, I would recommend you to check generic time series databases like InfluxDB or Prometheus.

Thanks to sangria’s middleware, it is quite easy to integrate with these. All it takes is just a few lines of code, but first you’ll need some way to aggregate and send metrics in your JVM process. Something like this (you probably already have something similar lying around in your codebase):

trait MetricsReporter {
def reportExecutionTime(
name: String,
ms: Long,
tags: Map[String, String]): Unit

After you have implemented one (or used one of the available scala libs for it), you need to implement the middleware that will report field execution time:

After everything is up and running, you will see something similar in your InfluxDB:

An example of GraphQL field metrics shown in InfluxDB

With all these metrics in the time series database, you should be able to aggregate and analyze the data with a dashboard like Grafana.