GraphQL and Resource Limitations

Over the last few months, the Engineering team here at Greenhouse has been developing a general purpose GraphQL API for our Onboarding product. Being a relatively new technology, GraphQL posed some interesting and unique challenges. We learned quite a bit along the way, and we’d now like to offer a bit of a retrospective on our experience. Hopefully you can take some of the lessons we learned and apply them to your own API.

This particular blog post will focus on limiting resources in GraphQL. We’ll briefly discuss how it differs from rate limiting in a typical REST API, how we’ve currently chosen to implement it, and how we’d like to enhance it in the future. All code examples will be provided in Ruby using the fantastic graphql gem, but the topics discussed here will apply to any GraphQL API — regardless of language or framework.

How It’s different than REST

While researching GraphQL, we quickly realized that our standard rate limiting approach would not work. If you’re familiar with any of the other Greenhouse APIs (e.g. Harvest), you’ll know that we use a fixed-window rate limiting algorithm to prevent surges in requests from any given API key. Put another way, each API key is allocated X requests for every Y second interval. If a consumer were to make X+1 requests within that time window, the last request would fail with a 429 response.

This approach works well with REST APIs simply because the result of any given request will involve a foreseeable cost to service. For example, in our Harvest API, every request to the GET Employee endpoint will entail a fairly predictable amount of work - the returned object is (for the most part) always the same shape and contains the same fields. This, however, is not true with GraphQL. GraphQL requests are incredibly flexible. They allow consumers to specify exactly what information they want from us. Consequently, a single GraphQL request could generate as much work as multiple REST requests.

Ok, so, what now?

Since request counting alone wouldn’t prevent a consumer from overloading our servers, we considered a hybrid approach: perhaps we could limit the number of requests in a given time period while also limiting the effect of any given request. This would allow us to manage the volume of requests being processed while also ensuring that each request remains within the bounds of what we deem reasonable.

Depth Limits

One of the more obvious ways to craft a malicious GraphQL request is to exploit nested relationships on a given object. For example, the Onboarding API exposes an Employee type. Each Employee can belong to a Department, and each Department can have an Employee in the form of a departmentLead. Recognizing this, one could form a query like this:

{
employee(id: 1) {
# all the fields on employee...
department {
# all the fields on department...
departmentLead {
# all the fields on employee...
department {
# all the fields on department...
departmentLead {
# all the fields on employee...
department {
# all the fields on department...
}
}
}
}
}
}
}

A cyclical relationship could very well exist between the Employee and the Department (e.g. Artemis is the Engineering Department Lead, and she also belongs to the Engineering Department). This is a circular reference that could lead to significant load on our API servers.

The aforementioned GraphQL gem provides a mechanism to recognize and reject a request like this: depth-limiting. Depth-limiting places a ceiling on the amount of nesting that can be present in any given request. Before the request is ever processed, we first calculate it’s deepest nesting level (there are 6 levels of nesting in the query above). If the nesting depth exceeds our limit, it is rejected outright. The gem allows this to be specified with one simple line of code:

AppSchema = GraphQL::Schema.define do
max_depth 5
# more app configuration...
end

With this configuration, our API will only service requests with 5 levels of nesting or less, and the example request above would be rejected before we ever begin to work on it.

This approach has a few advantages. For one, it takes care of a large category of dangerous queries. Secondly, it is easy for consumers to reason about and understand. You simply count the levels of nesting in your query and, if it’s under the limit, you’re good to go.

Complexity Limits

Unfortunately, depth-limiting isn’t able to prevent all possible troublesome queries. To get around the depth limit, we can simply request many fields on the root query object. Take, for example, this request:

{
employee_1: employee(id: 1) {
# every possible Employee field nested as far as the depth limit allows
}

employee_2: employee(id: 2) {
# every possible Employee field nested as far as the depth limit allows
}

employee_3: employee(id: 3) {
# every possible Employee field nested as far as the depth limit allows
}
  ...

employee_n: employee(id: n) {
# every possible Employee field nested as far as the depth limit allows
}
}

This request consists of multiple Employee queries on the root query object (each one aliased with employee_n), and each one generates a non-trivial amount of backend work to service. Each of them adheres to our depth limit, so it's a request that our servers would attempt to process. Taken together, these queries could generate a good deal of stress on our infrastructure.

Since depth-limiting wouldn’t protect us from expensive requests of this form, we decided to switch strategies: imposing complexity limits on all requests. Similar to our hybrid depth-limiting approach, we utilize a fixed-window algorithm to limit the number of requests over a period of time. However, instead of using depth as the measure of a request’s cost, we now use a score that reflects the totality of fields requested.

Once again, the GraphQL gem provides built-in support for this kind of complexity analysis. It allows you to assign an arbitrary number of “points” to any field. Prior to executing a request, the gem will use these configured point values to calculate the total “complexity” of the request. If the total complexity score exceeds our configured threshold, it immediately halts execution and returns the appropriate error message.

To keep things straightforward, we configured every one of our fields to have a score of 1. Thus, to calculate a request's complexity score, you simply add up every requested field. In the case of Connections, we add up every field requested within the Connection and multiply it by the number of records requested (either the first or lastargument).

For example, the following request would have a complexity score of 18.

{
employee_1: employee(id: 1) {
email # 1 point
firstName # 1 point
lastName # 1 point
department { # 1 point
id # 1 point
name # 1 point
}
}

employee_2: employee(id: 2) {
email # 1 point
firstName # 1 point
lastName # 1 point
department { # 1 point
id # 1 point
name # 1 point
}
}

employee_3: employee(id: 3) {
email # 1 point
firstName # 1 point
lastName # 1 point
department { # 1 point
id # 1 point
name # 1 point
}
}
}

If we deemed such a request too burdensome, we could set the max complexity to be 17 (this a contrived example — in reality this request is very lightweight). Again, this is easily done with the GraphQL gem:

AppSchema = GraphQL::Schema.define do
max_complexity 17
# more app configuration...
end

The complexity-limiting strategy allows us to account for both excessively-deep and excessively-broad requests. Additionally, it allows us to easily change the cost of any given field if we find that it is more costly to evaluate.

A Completely Point-Based System

Though our hybrid complexity-limiting methodology is an improvement over the depth-limiting variation, there are still some improvements we’d eventually like to make.

The beauty of GraphQL lies in its flexibility — it allows consumers to describe precisely the data they need. If they don’t actually need an Employee's SignatureRequests, they can simply omit that field. Our current approach, however, does not encourage this behavior. We still rely on counting requests, but we limit the amount of data that can be returned in any single response. Because of this, a consumer might say "Hey, I get X requests per time period, why not request everything I can in each one?" It's tough to blame a consumer who does this - they won't be rewarded for requesting the bare minimum they need.

The real solution here is to completely forget about request counting and focus solely on “point” counting. This strategy would allocate a certain number of complexity “points” to each API key that can be used in a given time period. Consumers could use these points over many requests, or they could use all of them in one giant request. With a system like this in place, consumers would be incentivized to only request the information they truly need.

Going Forward

Though our current resource-limiting methodology fails to properly incentivize consumers, it fits our current needs. Going forward, this may not always be the case. We plan on monitoring the usage of the API and adapting our strategy accordingly. We may find that certain fields are more expensive, so we’ll increase their complexity costs. We may find that certain relationships require excessive JOINs in our backend — so perhaps we’ll consider adding a stronger point multiplier for them. Finally, we may get to the point where our hybrid approach fails to scale any further, and at that point we’d make the move to a completely point-based strategy.