https://www.pinterest.ca/pin/800303796252282285

The tension between data & use-case driven GraphQL APIs

If you’ve been following my posts over the past months and even years, you know how important I think it is to design well crafted GraphQL schemas that express real use cases. When in doubt, designing a GraphQL schema for behaviors instead of for data has been my go-to rule. An older post on “Anemic Mutations” explains my rationale behind that principle.

While this usually leads to a great experience when consuming an API, there is usually two pretty distinct uses for an API. The first one is the more use case driven one. Imagine an application using the GitHub GraphQL API: the schema our API is currently exposing is perfect for an application looking to interact with the GitHub domain and consume common use cases: think opening an issue, commenting, merging a pull request, listing branches, paginating pull requests, etc.

However, certain clients, when seeing GraphQL’s syntax and features, see a great potential for getting exactly the data they require out of GitHub. After all, the “one GraphQL query to fetch exactly what you need“ thing we hear so often may be interpreted that way. Take for example a comment analysis application that needs to sync all comments for all issues every 5 minutes. GraphQL sounds like a great fit to do this, craft the query for the data you need, send one query, and profit. However, we hit certain problems pretty quickly:

  • Pagination: the GitHub API was designed with snappier / use case driven clients. This assumes these kinds of clients probably don’t want to fetch a very large amount of comments in a single request, but rather recreate something a bit like GitHub’s own UI.
  • Timeouts: We’re pretty aggressive with GraphQL query timeouts with our API at GitHub, we don’t want to let a gigantic GraphQL running for too long. However, purely data driven clients might need to make pretty large queries (one query only, yay), to achieve their goals. Even though it’s a valid use case and not an abuse scenario, there is quite a high chance queries could timeout if they query hundreds, see thousands of records.

Pretty hard to deal with right? On one end, clients that have purely a data driven use case may have a legitimate reason to do so, but on the other end, our GraphQL API is not designed (and nor should be) for this purpose. In fact, it’s not a special problem. Most APIs out there today are mostly to build use-case driven clients, and would be hard to deal with when wanting to sync a large amount of data (large GraphQL requests, batch HTTP requests, or tons of HTTP requests).

So can we do? Let’s explore a few options.


Ship a new data driven schema

One option could be to expose a totally new GraphQL endpoint/schema for more data drive use cases. Get all issues and their comments without pagination. batch loaded types and resolvers. This could possibly be even in the same schema as different fields, but since they’re such different use cases, I can see them being in a completely different schema. The timeout problem still might be hard to solve however, because these kinds of use cases often aren’t simple to consume synchronously. So what if… we could run queries asynchronously instead?

Asynchronous GraphQL Jobs

POST /async_graphql
{
allTheThings {
andEvenMore {
things
}
}
}
202 ACCEPTED
Location: /async_graphql/HS3HlKN76EI5es7qSTHNmA

Then, clients can poll on that link until the result is ready:

GET /async_graphql/HS3HlKN76EI5es7qSTHNmA
202 ACCEPTED
Location: /async_graphql/HS3HlKN76EI5es7qSTHNmA
GET /async_graphql/HS3HlKN76EI5es7qSTHNmA
{ "data": { ... } }

The cool thing is these are actually very similar to the concept of persited queries. Applications could register GraphQL queries, and run them on a schedule, or asynchronously whenever they want. Again, this is not necessarily specific to GraphQL. Something I found out about recently is Stripe’s Sigma, with scheduled queries. A SQL powered tool to create reports / extract data out of Stripe for customers/integrators.

Streaming

Darrel Miller and I met for ☕️ recently and we talked a bit about this problem. One thing he mentioned is that an event stream would actually be great for clients who only care about data. Integrators can then keep things in sync / analyze data however they want. This really resonated with me. If an API client really only cares about raw data and no so much about a business/use-case oriented API, then they might as well connect to some kind of data firehose. The Twitter PowerTrack API is a good example of this. Allowing (privileged/enterprise) clients to consume 100% of Twitter’s tweet data.

Best of both worlds?

Maybe a mix of both is what we’re looking for? Register a GraphQL query to filter a firehose of data events, use subscriptions, and a separate more data-oriented schema:

subscription Comments {
comments(pullRequests: [...]) {
comment {
id
bodyHTML
author {
name
}
}
}
}

Here the comment field is populated with batches of comments as they come. Connect to a long lived HTTP connection and get a stream of events with the data matched by our GraphQL subscription. This way we push data at our own rhythm, but clients can still filter payloads using GraphQL. Of course this requires clients to be smarter and built more resilient client side applications. Retries, re-connects, connections dropping, consuming with multiple connections, etc. The server also needs to build resiliency features (last_message_received, replay_since, etc).


Food for thought 🌮💭, let me know what you think about the tension between these two pretty distinct use cases. I’m pretty excited how we can solve these issues, there is a lot of things to try out there!

Thanks for reading 🚀
- xuorig