Evolving Content Delivery at USA TODAY NETWORK — Part 3

Ryan Grothouse
USA TODAY NETWORK
Published in
4 min readOct 9, 2018

At USA TODAY NETWORK, we’re aiming for a better consumer experience and at the forefront of that is optimizing our application performance. We had to rethink how we measured application performance and safeguards as we switched our backend Content API from a REST-based architecture to GraphQL. Let’s explore:

The Problem

Historically, we set up dashboards and monitors based on a target baseline response time for a given HTTP route — a common approach. (This works well for a REST service because each resource has its own HTTP route and we knew the response time to retrieve any given resource.) GraphQL, however, runs on a single HTTP endpoint and provides an a la carte method of data retrieval that allows clients to specifically target the resources (potentially multiple) in a single HTTP request. A request to build a page that may span four, five or more REST calls can be accomplished in a single GraphQL query.

Let’s look at an example:

In our legacy REST API, we knew that an article retrieval via a call to /v1/article/1 had a response time of ~50ms. If we see a sudden spike in that baseline, we have a problem.

In GraphQL, a client could make a similar request:

{

article(id: 1) {
id
body
}
}

which would have a response time that tracks closely to the REST call, because the server does the same amount of work.

However, as we explored in a previous blog post, one advantage of GraphQL is that a client has the ability to request any number of resources in a single HTTP request:

{

article(id: 1) {
id
body
}

image(id: 2) {
id
url
}

boxScore(id: 3){
date
}
}

This request would perform three times the work of the first call, so understandably the response time is higher. In our REST service, clients would have to make three independent HTTP calls in order to get to the same amount of data as the GraphQL example: one for articles, one for images and another for box scores. The GraphQL query is far more performant when you think in the mindset of the client and take into consideration the sum of all the response times to complete the transaction (ie: building a webpage). We’ve given the client the ability to request more work in less roundtrip calls. For the backend, this makes measure application performance more complex.

The Solution

We’ve established that it’s not enough for us to only monitor HTTP response times, as one client may request more resources than the next and response times between transactions will fluctuate. We developed a way to normalize response times by factoring in the amount of work being asked of the service. We called this metric per-resource response time.

To accomplish this, we started tracking the number of resources returned in a GraphQL transaction and divided that by the response time for that transaction. This normalizes our response times and gives us our per-resource response time, which becomes our KPI and the metric we use to identify issues.

We track this result over time:

We also built a heat map around this metric to find our hot spots:

Resource Exhaustion and Query Complexity

As it pertains to application performance, measuring response time wasn’t the only consideration shifting from REST to GraphQL. We had to rethink what safety nets were in place for protecting the application from resource exhaustion. In the past, we could limit the number of requests per second per client at our API gateway. This was enough protection to thwart off run-away clients or denial of service attempts. With the flexibility of GraphQL, a client is inadvertently given the power to craft an overly complex query that could potentially lead to resource exhaustion. To mitigate this risk, we knew we needed a way to calculate query complexity.

We thought we could use the number of resources returned per transaction (like we used to calculate the per-resource response time) but that has two flaws:

  • The number of resources isn’t known until the processing has completed and by this time, the damage is done.
  • Data changes, which we do not control, could suddenly begin failing queries that previously succeeded.

We instead developed a method to gauge query complexity, which is derived from nothing other than the query details itself.

How it Works

Up front, we assign a complexity value to every resource and field in the code. At request time, before any real processing begins, we calculate complexity. We have defined a max query complexity allowed by the system. If a request calculates to a higher complexity than our max, we immediately reject the request:

{
“data”: null,
“errors”: [
{
“message”: “maximum complexity cost 120 exceeded, query cost 132”
}
]
}

Additionally, successful requests include the query complexity value, giving developers visibility into the complexity of their queries:

{
“data”: {
“article”: {
“id”: “1296261002”
}
},
“queryComplexity”: 11
}

In addition to protection from resource exhaustion, we found the added bonus that query complexity drives better behaviors of our clients, as developers now have a means of measuring and optimizing their queries.

GraphQL has driven us to rethink how we design, consume and even monitor our application stack. The flexibility it provides affords us tremendous benefits but it also introduces new complexities. We’ve explored how we have adopted per-resource response time and query complexity within our application performance monitoring arsenal to solve two of these newfound complexities.

--

--