NFRS AND SLAS; THE HIDDEN COMPLEXITY OF AN API — API IMPLEMENTATION PART 1
I often have a little chuckle when people have just discovered APIs thinking they are something new… considering that APIs have been around since the early days of computing. Of course Restful APIs are much more recent — but that’s an API which adheres to the REST paradigm. Splitting hairs? Perhaps. It still amuses me though.
Okay, an API is great if you know what its syntax is and what it’s supposed to do, but how about if it’s thread-safe?
I like APIs — in all their forms. Without them I’d not be able to do my job effectively. But over the years I have come to appreciate their limitations and strengths. Assuming the API actually “works” — I think the biggest limitation is documentation. Okay, an API is great if you know what its syntax is and what it’s supposed to do, but how about if it’s thread-safe? Is the result cached? Are there sensible limits to the size of data to pass in or expect back? Is it okay if I make a burst of 1000 calls, or should I trickle them, or should I use a different API? It can also be useful to know how something is calculated… a call to get the machine name — does it come from a system call, an environmental variable or is it hard-coded?
Knowing this aids me in writing client code. Of course one person’s client is another’s server — so I’m left with the awkward question of — do I tell my “clients” the information I find useful. No comment.
Now a Restful API is often operationally different from a module interface designed to be included within a program’s executable. Along with any network API, Restful APIs are designed to run exposed on a network — either publicly on the internet or internally for other applications and APIs to consume. It adds a layer of complexity. Security, resilience, version management, caching all become important considerations, as over a network the relationship between a caller and the called code (or client and server) is much looser and not enforced by a compiler
Trying to code these into every API you write is a royal pain… even when you use frameworks to help — ensuring their configured consistently and reliably is at best a challenge. And not all developers are equal — hence we should have batteries of tests to detect and eliminate defects… of course the tests are not always the best… hey ho.
Thankfully that’s why we have API management tools — acting as a proxy between the API execution code and the real world. IMO it’s a bit of a no-brainer to have a layer dedicated to providing specific services so you don’t need to repeatedly code it up in the execution code. Having an “umbrella” where you can apply a management policy uniformly and consistently. Having a place where you can see at how your API portfolio is performing and identify bottlenecks and inconsistencies is invaluable. To be able dynamically respond to issues is powerful.
But herein lies a danger.
And it goes back to my starting comments — that documentation is really important. Actually, let me refine that slightly: accurate documentation is really important. And details beyond simply the interface are important parts of that documentation — the non-functional parts.
NFRs (Nonfunctional Requirements) to consider. Source: techcello
Say for instance a balance check API is being hit very heavily — a new mobile app has been launched and has gone viral. It polls a customer’s balance every few seconds — and the millions of calls which are being made are causing the backend systems to struggle. A quick decision is made to place a response cache on the API with a TTL of 5 minutes — immediately there’s 150 times less load on the backend system. It solves the problem — great!
Shortly later the helpdesk get inundated with calls such as “Why is my balance incorrect after I’ve just transferred funds…”
You may be thinking — “That could never happen here! Our governance checks and processes would stop that…” if so, well done — bully for you! I suggest you’re in the minority.
I would have hoped that through good plan and design this situation would have been avoided — the need for a cached or read-optimised view of balances would have been identified and built as part of releasing the mobile app. But I know from my own experience as well as the many conversations I have with customers and peers — APIs often slip through the cracks when it comes to good planning and design.
When it comes to APIs, good design often slips through the cracks.
I would also have hoped that operations would have realised that changing the SLA or NFR for API would have impacts elsewhere… but how would they know? If they are simply looking at traffic and seeing thousands of reads-per-second, how are they to know that some are fine to have slightly out-of-date information, but a few are absolutely critical? How would they know that the transfer API needed to invalidate the cache for an account?
But it all comes back to design. Either doing it well to begin with, or using at the design to understand implications. When I talk about the best APIs explaining if they are thread-safe, or what the expected performance is, what I’m saying is some of that design information is visible to me.
When you can “tweak” the management policies — do those tweaks get back into the design and into the documentation?
This example has immediate impact, but what about if a new application was being developed using the get balance API believing it to be up-to-date information? How would the project feel about the delay because of needing a new API only discovered during final acceptance testing because the “production-only cache” was not known about…?
Dynamic changes are powerful. But they need to be used with caution.
As you might imagine, from my position as Chief Architect at digitalML, I see a different approach.
Let’s assume that the planning missed the need for a new service upfront. Nobody’s perfect, eh?
Operations are alerted by the API management tool that there’s a problem. They see it’s the Get Balance API. They can see that it has no caching enabled so is impacting the backend system. They also see there are a number of client applications using this API, and because of the consumer contacts, many require up-to-date information. They see also that the balance is updated by several other APIs, including the Transfer Funds API.
So they add a TTL to the SLA of the design. ignite takes that information and schedules the API management system to add a cache to the next version of the API. An approval is sought and given by the owner of the API. At the same time, a new version of the Transfer Balance API is prepared which has a cache invalidation policy added to its flow.
The new versions of the Get Balance API and the Transfer Balance API’s are deployed and activated — being a configuration change, this happens in moments. The changes flow to the documentation so future developers know the SLA of the Get Balance has changed (though mostly it should be unnoticeable, given the additional cache invalidation). Current users of the API are similarly informed of the changes.
Of course, the important thing here is to use your current technology stack — not require replacing the stack with just another proprietary stack. Having the runtime configuration “flow” from the design avoids the handover from design to implementation, to runtime execution. It encourages making the change at the right place rather than a “hack” for expediency. It makes it easy to get the design right, and makes it quicker and better to change the design than to hack and hope.