GraphQL: Instrumenting your API and unlocking superpowers
This post is the second part of a series of best practices and observations we have made while building GraphQL APIs at PayPal. In upcoming posts, we’ll share our thoughts on: schema design, error handling, auth, optimizing client-side integrations and tooling for teams.
You might have seen our previous posts “GraphQL Resolvers: Best Practices” or “GraphQL: A success story for PayPal Checkout” about PayPal’s journey from REST to GraphQL. This post dives into instrumentation and why it’s so important, why instrumenting your API unlocks superpowers and a couple of strategies for instrumenting your API.
Instrumentation is adding the proper amount of logging to your application’s code so that you know exactly how it’s being used.
With GraphQL, clients have to explicitly ask for the data they need, at a field-level. They only receive those fields, no more and no less. Ex: Clients can’t simply ask for all of the fields of a User type. Because of this design constraint, instrumentation is much more useful in GraphQL than traditional REST APIs.
Evolving your API with confidence
In REST, clients request resources or objects. You might know how many times your clients invoked your GET /projects endpoint, but you don’t know what fields of that resource they really need. You don’t have insight into what parts of that resource are problematic or slow. Even worse, you don’t have insight into how your clients are using your APIs, so you’re more likely to keep extending and growing your API because you’re too afraid to make a breaking change.
In GraphQL, changes can be made incrementally over time. You don’t need to create a new major version (Ex: /v3) and maintain multiple versions of an API. You don’t need to kick off a large program to migrate all of your clients to a new major version.
If you want to rename or remove a field and your instrumentation says nobody uses it? Great! You can remove it with confidence.
If you want to completely restructure your schema, you can incrementally migrate too! Create and deploy your new structure, then mark the old one with a @deprecated directive so that it’s hidden from new clients, inform your clients who are requesting the old field and safely remove it later.
Instrumentation opens the door for powerful tooling!
Now, you know exactly what fields are too slow or cause too many errors. You also know exactly how many times your fields are requested and by who.
For starters, you can pipe that data over to tools like Grafana (shown below).
Next, Marc-André Giroux’s talk on Continuous Evolution of Schemas shows a pull request bot that GitHub uses internally. If you want to remove a field and you have clients querying for that field, your change is rejected. If nobody is using that field, great! Go ahead and remove it!
Finally, if you want to remove a field and you know your clients are using it, you should deprecate it using the @deprecated directive, rather than removing it.
Because you know what clients request what fields, you can proactively inform them when you deprecate something they use. At PayPal, we found that by being proactive and delivering changes in small bites, product teams are more likely to migrate sooner. Large programs, large migrations with complex planning processes can be daunting and disruptive to a developer’s workflow
Implementation in practice
In practice, the easiest approach is to instrument resolver functions that are invoked with each query.
In Node, you wrap all of your resolver functions with another function that takes the difference between the start and end time. If you’re using apollo-server with Node, you can simply enable the tracing flag when creating your instance of ApolloServer, then take the “tracing” part of the response and do what you want with it.
Other language bindings (Ex: graphql-java for Java) have options too!
Although instrumenting resolver functions is the easiest way to get started, it’s not 100% accurate. A better approach is to evaluate what queries are requested by your clients. There might be a situation where a “parent” field (Ex: user) is currently broken, execution halts, and the name resolver isn’t invoked. If you go with the resolver approach, you might think that your name field is never requested and you think it’s safe to remove. Then, you fix the user resolver and you break your client. Sure! It’s a bit of an edge case, but it’s better to be safe.
A better approach is to evaluate what queries are sent by your clients, rather than what resolvers are executed. This isn’t as straight forward as instrumenting resolvers and there aren’t any well known open source solutions out there. If this sounds like an interesting problem that you would like to solve and open source, our community will thank you. :)
It’s also worth mentioning that Apollo Engine is a really great paid offering that solves for this too. Not only does Engine capture timings and error rates for fields and overall queries, but it has some really nice visualizations too. Check it out!
Thoughts? We would love to hear your team’s thoughts on instrumentation. It’s a topic that is very powerful and interesting but is still ripe for innovation. The best tools still haven’t been invented!
In upcoming posts, we’ll share our thoughts on: schema design, error handling, auth, optimizing client-side integrations and tooling for teams.
We’re hiring! 👋 If you would like to come work on front-end infrastructure, GraphQL or React at PayPal, DM me on Twitter at @mark_stuart or check out the PayPal Jobs site for teams developing with GraphQL or React.