API design fundamentals: analytics or it didn’t happen
APIs are a very special kind of product and one key distinguishing factor is that it’s very easy to check how people use them: put a checkpoint in both your input (request) and your output (response) and you are gold.
Truth be told, things do get more involved than that eventually (see below), but the fact still remains:
API monitoring is a great bang for buck: with very little effort you can reap massive benefits.
In the remainder of this post, we will evaluate the two parts of the statement above: the effort and the benefit.
Many of us have had to go through it, at some point in our career: a severe performance incident happens and every involved party is rushed into a call. Without proper end-to-end performance tracking, those calls usually started with a 30-minute, uninterrupted “wasn’t me” round table:
Network engineer: I’m looking at the logs here — it’s not the network!
DBA: I’m looking at the statistics here — it’s not the DB!
Sysadmin: I’m looking at the server metrics — it’s not the server!
Developer: I’m looking at my app metrics — it’s not the app!
Call host: sigh, then surely this issue is a figment of all of our users imagination…
Or perhaps in a classic quis custodiet ipsos custodes case (geek kudos to anyone who catches the reference without the link), your org reports uptime as the amount of time you have not detected you are down. So, as long as you do not detect your service is down, it means it’s up and running, right? Heck no. Your uptime must be measured from an end user’s perspective and must even consider even longer-than-acceptable response times (pro tip: these days it’s in seconds, not minutes).
What happens if some operations run deep into your infrastructure, compounding calls to multiple services in order to compose a final response?
The answer to that is to use an id to establish those correlations. One of the most popular techniques is to issue a request id at your entry point and use it for tracing. And yes, there are tools out there to help you with that. Most full-blown APM software have features for that.
While tracing the calls by means of a request id is a very good starting point, it is frequently not enough: at some point in time, it is possible that your event sequence goes outside the boundaries of a simple request/response flow, meaning other correlation means may be required. Usually the most robust correlation mechanism revolves around using entity ids whenever they are available.
People (including myself) get all hyped up about performance monitoring (those graphs are so cool!), so it’s easy to ignore this hidden gem: a core practice for good API design is to base it in usage and this is your #1 opportunity to collect that information from its best possible source: the real world.
Thinking of introducing a breaking change? A look at your metrics will tell you the size of the impacted audience. Track your client ids and you can even reach out to those teams. GraphQL takes this one step further by requiring clients to inform the specific fields they want (so you’d know exactly who is using a field you want to retire, for example).
Looking to improve API UX? Look for usage patterns: commonly used long query strings indicate URLs that could be simpler. Commonly called together endpoints indicate the potential for merging endpoints (or providing an include mechanism).
Setting things up
The easiest and most convenient way to get started on this world is to leverage services such as New Relic, Dynatrace and AppDynamics (just to name a few) as a starting point, in case their features and pricing model fit your needs. The truth being that it’s usually cheaper to hire these services than build them on your own and that you probably do not need anything out of the feature set provided by these tools.
If your stack is mostly consistent, most of the big platforms have OSS and commercial tooling geared for it, although beware that maintaining that structure will require extra effort.
It’s also worth mentioning that the next natural step is to put in place tooling and processes in place to quickly act when these metrics indicate your systems have gone AWOL.
In a world where systems are measured by 9’s, it’s a smart move to keep a close watch on what is going on.
But that alone is focusing on FUD instead of the sunny side of monitoring: this is hands down the best way for you to collect real-world, unbiased feedback on how people use your APIs — so why miss out?