When Client Developers Love GraphQL To Death

Scott Malabarba
Dec 11, 2019 · 12 min read
Exploding building
Exploding building
Photo by Stephen Radford on Unsplash

One day our devops director said to me, “Hey, I’m testing our AWS scale-out policies — can you make a GraphQL query that stresses the container?” So, knowing every intricacy and weakness of the API at that time, I did my worst.

My worst stressed that container, all right. Then it kicked the container in the groin and jumped up and down on it and rolled it off a cliff into a pool of lava (just came off a Minecraft break, okay?). This was on a staging environment, of course. I think.

So I figured out every hole in the API query-handling guardrails and fixed it. In this case, insider knowledge and adversarial intent was involved. But every other time, it was a client developer who did it by accident.

Here’s the problem:

GraphQL is awesome! It lets a client developer come up with their own API!

OMG, GraphQL lets a client developer come up with their own API… Against your back end.

All these problems can happen in a REST API too, and I’ve seen most of them. The difference is that REST server developers have to specifically enable them, whereas GraphQL sort of invites abuse.

Let’s take a simple GraphQL schema.

type Author {
books: [Book!]!
name: String!
}
type Book {
isbn: ID!
# URL to an S3 object containing a 8000-word excerpt
excerptURL: String
# An excerpt, up to 8000 words
excerpt(
# language for the excerpt. If not the language of the
# original text, it will be translated.
language: String
default: "en-us"
): String
title: String!
authors: [Author!]!
}
type Query {
books(
# get book exactly matching specified ISBN
isbn: ID
# get books with substring match on title
title: String
): [Book!]!
authors(
name: String
): [Author!]!
}

Sweet! We can do queries like this, to get translated excerpts of a book:

query {
books(isbn: "123") {
en: excerpt(language: "en")
fr: excerpt(language: "fr")
ja: excerpt(language: "ja")
}
}

or this, to get all books by a given author:

query {
authors(name: "me") {
books {
isbn
title
}
}
}

But we can also run queries like this:

query {
authors {
books {
authors {
name
}
}
}

books {
title
authors {
books {
isbn
title
}
}
en: excerpt(language: "en-us")
fr: excerpt(language: "fr")
ja: excerpt(language: "ja")
}
}

Why would you EVER do that?

Well, imagine that we have a book list app. Its landing page shows tiles, one for each book in the system. Each tile needs to have the author’s name and a mouseover showing authors they’ve worked with, and the book title mouseover should show an excerpt in the user’s native language (based on their browser settings).

The UI developer looks at those requirements and thinks, “Awesome! I love GraphQL! I can get all that in one query so that my UI is efficient!”.

My example schema is, admittedly, contrived, but I’ve seen weirder queries in real life — and usually for good reasons, based on the UX requirements.

Now let’s think about what the back end is doing.

Typically, the resolver functions for the books and authors queries would make a query against the data store — something like SELECT * FROM book. The resolver for an object-valued field like Author.books would make a query like SELECT * FROM books WHERE author_id = 123. Hopefully there is default paging so a LIMIT 100 gets added.

The top-level books query would be something like SELECT * FROM BOOKS.

If the data model is clean, then this is a simple and efficient query that returns quickly to the GraphQL server. So we’ve got some books. Yay! But now the server has to resolve each of these fields on every book row returned:

title
authors {
books {
isbn
title
}
}
en: excerpt(language: "en-us")
fr: excerpt(language: "fr")
ja: excerpt(language: "ja")

title is easy; it was already in the object we got back from the database layer.

authors, however, requires its own SQL query like the Author.books example above. So now we’re going to hit the database again, for each book row. This is also a pretty lightweight query, but we might be doing many of them depending on the limit/page size and size of the table. If the books table has 100 books and each author has on average 10 books, then we’re now making 1000+ SQL queries for this request.

Now let’s take a look at those excerpt fields. The excerpt is stored at an external URL referenced in the database, so in the resolver function the server will have to use an HTTP client library to hit the URL and load the contents into memory. This is, most likely, more time-consuming than the database queries.

Worse, two of them must be translated (assuming that the original is en-us). So the entire excerpt must be streamed out over HTTP to an external translation service, and the (eventual) response streamed back before the resolver can return.

We said in the schema that the excerpt is at most 8000 words (on average, three chapters or a long-ish short story). Assuming we only need to load that much and that words average 5 letters, that’s at least 40k each (probably more, depending on the source file format and encoding).

That probably doesn’t sound like a lot. But we have three of them here, and the memory space is duplicated at least once for the translated versions, so we’re eating up 200k+ of server memory per book. For 100 books that’s now 20 MB.

We’re going to duplicate all that again when the server serializes the data to JSON and streams it out.

Let’s add up the resources required so far for this query.

- 1000 SQL queries

- 300 HTTP calls

- 40+ MB memory

That might not seem like a lot. But let’s drill into each one.

That 300 HTTP calls should raise a red flag right away — we’re calling out to an external endpoint that does something fairly complex (translation), so assume something like 500 ms per call. The server should be making the calls concurrently (Node.js does this automatically), so it’s not like we have to wait 150 seconds for everything to finish. However, the server still requires a total of 150 seconds of HTTP connection time.

Most HTTP libraries implement automatic connection pooling, which means the GraphQL resolver function will grab a connection from the pool instead of making a new one and wait if all the connections are in use. Whether or not a connection pool is used and its default max size depends on the HTTP library in question, but it’s something to watch out for. In early versions of Node.js it was just 5, though with modern versions the default is infinite. But there might also be a limit on max sockets at the operating system level within the container or its host.

Say our connection pool max size is 100. Then, for this request, the server will quickly max out the pool and have to wait about 500ms for the connections to free up before making the second and then third batches of requests. So the overall request takes 1500ms+.

Much worse, however, is that the pool is shared across the server. So any HTTP connection attempted by any resolver for any other request will also block for 500ms+. Connection starvation can effectively break APIs that are unrelated to the initiating request and do not require extensive resources. Multiple repeated expensive requests can effectively DoS the server; what happens if each request needs to call into an authorization microservice over HTTP before proceeding? Oops!

A properly sized production SQL database will return from simple queries in < 1ms. But that same connection starvation we saw for HTTP requests can affect database queries in the same way.

In my examples the queries are simple and don’t stress the database. But what if they’re not? What if the queries involve complex joins or go against tables with 100 million+ rows that are not properly indexed? Or what if the database bogs down because of something completely unrelated? Like, someone accidentally called create index huge_index on books(title) without concurrently and books has millions of rows (seen it, maybe <ahem> done it…).

Then those 1000 queries are going to take seconds instead of milliseconds, eat up the server’s entire connection pool, and possibly eat up the database’s entire connection pool as well (most databases have a max concurrent connections setting — if any client makes enough connections to exceed the total max, then all clients are affected). And we’re down.

Now let’s talk about memory. 40 MB doesn’t seem like a lot.

But consider how the server is actually deployed. It’s probably using something like AWS’s Elastic Container Service, where we have a big, dynamic pool of relatively small VMs. 2GB RAM is common.

Typically VMs are sized and scaled to be near-fully utilized — if we have 4 GB VMs but average memory usage is only 1GB, then we’re over-paying and chances are someone’s going to scale down the VM size.

Now imagine that 10 of these requests hit a 2 GB container at once, or that a client asked for 1000 books instead of 100. Now we’ve eaten up 400 MB of memory. If the container was at 1500 MB mem used, then the runtime is probably going to start garbage collecting, which slows down processing. If it’s at 1600 or more, then it might start thrashing. If memory was already tight, at 1900 MB or more, then it will likely crash or stop responding.

We’ve now taken down a container in several different ways, with a query that is not even that bad. I’ve seen much worse… query { foo(limit:1000) { bar(limit:1000) { … }}}. Sometimes people basically ask for the entire database back.

Yes, it’s only one container, and we’ve got at least three (right?) and maybe hundreds. But, when a container dies or slows down, then any client that had a request in-flight is going to get an unexpected error back (502, 503, or 504, most likely). That’s bad. We’ve got a sudden spike in errors. Someone’s probably going to notice. If we’re striving for five 9s of reliability then our SLA just got trashed and alerts are going off and if it’s 3 AM then someone might be getting dragged out of bed.

Now imagine that a client is hitting the system with these requests repeatedly. This could be an integration script somewhere that makes paged queries in a tight loop, or single user pounding “refresh” on a non-responsive web UI.

It can take a minute or more to fully cycle up a replacement container. So… One after the other the containers crash and aren’t replaced quickly enough to keep up. Boom. DOS (seen it). It’s even more fun if repeated expensive queries knock down the database or a critical external service (seen that, too).

Okay, now what do we do about this?

Caching objects we retrieved from the database within the context of a single request is a common optimization and easy to implement, either with open source tools like DataLoader or custom code. That at least prevents the server from fetching the same object (say, a given author) N times in a single request. It doesn’t help if the results don’t include repeated references, though.

Caching objects in memory across requests increases the rate of cache hits, but invites bugs. I do this only when absolutely required as a critical path performance optimization. And it still doesn’t prevent that first request, before the cache is populated, from taking down the container.

We can limit page size. This one constraint goes a long way, though you still have to be careful of nested paged fields. It’s hard to say what the limit should be. 10? 30? 100? Some people are comfortable just picking a number, but to me it feels arbitrary.

Regardless, page size limits are hard to add reactively because doing so will break clients. Much easier to add it in the beginning, consistently, so that client developers are forced to implement paging properly across the board. Still, expect to get questions and complaints about “not all data being returned”.

We can compute and limit the size of server response to prevent memory overload. One handy way to do this is with a resolver wrapper function. Each resolver computes the size of its data and increments a counter stored in request context. The server errors out if it the size exceeds a threshold.

This isn’t hard to implement, but has some performance cost since the wrapper is constantly re-crunching data (the efficiency of object size computation depends heavily on the programming language). It’s also hard to make accurate — in Node.js, at least, it’s difficult to pair a query result down to only those fields the user requested, so the object will have fields that came from the data store but aren’t returned to the client. This is actually better for protecting server, since it’s the actual memory usage that counts. It just might be confusing to users who get an error message.

Okay, we did all that, but the server still crashes! Why? Maybe because a resolver function pulls in a response from an HTTP API and that one download was 500 MB by itself and crashed the server before our resolver function wrapper could kick in.

Now we need to write a custom download function that tracks download size as it streams and cuts the request off if it hits a threshold. This is good; users should be discouraged from using GraphQL to stream large objects (that’s not what it’s for).

Now we’ve got some pretty good protection. But in a complex data model, we’re still vulnerable to large queries with nested fields. And developers love these because they’re powerful and convenient.

The most robust and flexible solution is to add query cost computation and limits. Then we might allow a developer to do this:

query {
books(limit:100) {
title
}
}

or this:

query {
authors(limit:10) {
books(limit:10) {
title
}
}
}

or this:

books(limit:10) {
title
excerpt(language: "fr")
}

but not this:

query {
authors(limit:100) {
books(limit:100) {
title
}
}
}

or this:

query {
books(limit:100) {
excerpt(language: "es")
}
}

A cost can be computed based on the limit value and some defaults, like assuming that every field is cost=1. Then we should be able to use our knowledge to resolver implementations to fine-tune the computation. In our example we might do this:

type Book {
title: String!

authors: [Author!]!
@cost(cost: 2) # costs more since it requires a SQL query

excerpt(language: String!): String!
@cost(cost: 10) # costs a lot more since it calls out
# to an external translation service
}

Say we set max cost to 200 (a somewhat arbitrary number). Then our “good” queries would succeed, but the problematic queries would exceed the cost and be rejected immediately.

The logic for this is pretty straightforward. I rolled it myself a while back and it was a small-ish project. These days there are open-source frameworks, such as graphql-cost-analysis for Node.js (I haven’t used it yet, but it looks well thought-out and does everything I ever needed).

Put that in place and you have some solid protection against expensive queries, and client developers get immediate feedback as they’re developing that their queries need to be pared down.

That was easy, right?

Enforcing cost limits after people have been using your API in production for a year?

Aaaaaaaaahhhhhhhhhhh!

Now we have to:

  1. Figure out what we want the cost limit to be
  2. Find every query in use that exceeds it
  3. Figure out who’s making the queries and get them to fix them (or do it ourselves)
  4. Stop people making new costly queries
  5. Finally, turn on hard errors for cost limit exceeded

Remember all that logging and reporting and monitoring I wrote about for deprecated APIs? Well, we get to do it all here, too. Also a bunch of other stuff like alerts and posting errant queries to Slack and wrestling with teams that are making queries with cost of 1,000,000,000 but don’t have bandwidth to change them. Good luck!

Or I guess you can just turn on the limit and break everyone’s client code and sip coffee while you watch them scramble to fix everything. That would be amusing, but I’ve never had the political capital to try it.

Another option is to just whitelist queries. In this model, client developers can’t just make any query they want, at least not in production. Rather, they need to request or register each query with the API team, who can then review and test each one to make sure it can’t cause any performance problems.

Facebook (where GraphQL originated) does something like this — they pre-compile the queries and let clients reference them by ID (this also saves parsing).

For some use cases whitelisting works well. It’s very restrictive, though, so probably best for cases where client and API development are very closely tied, external developers are not using the API, and there’s enough process in place to keep things smooth and reliable. It’s never been an option for me.

There’s a lot more to this. All of the complicated and devious things I’ve done to figure out why servers are crashing and know when they’ve crashed and find out who’s crashing them and stop them from crashing would fill a book. I do write books, but they’re generally about aliens and sarcastic middle-schoolers. For today I’ll leave it with those key techniques.

Punch line: I really I wish I’d plugged in cost computation and a hard limit from the start. If you’re just starting with a GraphQL API, do that!

The Startup

Medium's largest active publication, followed by +564K people. Follow to join our community.

Scott Malabarba

Written by

I write about tech, including APIs, AI, agile, freelancing, and startups.

The Startup

Medium's largest active publication, followed by +564K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade