“We do REST” is not Enough
TL;DR Here are a few things to think about when designing service interface (API) contracts.
The HTTP stack’s place at the center of service network exposure has been a foregone conclusion since post-AJAX REST revolution a decade ago. It has gotten to the point where the “hello world” application for modern languages and frameworks presents the message via web server.
I like REST. I think it and its evangelists have really turned the world in a great direction by helping people think about declarative state and encouraged people to learn existing protocols rather than just replicate functionality over the top. But REST is not enough.
I feel underwhelmed — or maybe just incomplete — most of the time that I talk to people about APIs or read an article about APIs or read an API’s documentation. It seems like half the time that I ask, “What can you tell me about your service interface?” I’m met with “Oh, well we use REST” and a trailing long pause. You’ve told me nothing. I’m glad you’re using REST and I’d like to know more, so I finally ask, “What else can you tell me?” If I’m lucky I’ll get a high level description of a few abstractions, maybe a URL, or sometimes a description of their message authentication and authorization scheme. Those are all great details. Far better than, “We use REST.”
I could go on for days about the underlying problems with that kind of uniformity in thinking, but I’d rather write about a more functional issue.
A service interface contract is a composition of the parts of your service that are exposed to consumers. They are the things that customers need to know in order to use a service, and the things that customers learn from using the service. The components of a service interface contract are important to understand because a service is broken from a customer’s perspective if they change unexpectedly.
Components of a Service Interface Contract
There are four components to every network service interface contract and I think most developers only ever really consider two at a time:
- Explicit structural (names, parameters, etc.)
- Identity (endpoint, crypto ID, etc.)
- Implicit behavioral (consistency models, durability, etc.)
- Emergent (latency, parameter subdomains, availability, etc.)
Almost everyone nails the first one (once we get past the REST part) and the second is quick to follow.
The explicit structural component of a service interface contract covers REST, all of the nouns and verbs, all of the attributes of each abstraction, mutability, response codes or messages, authentication headers, and CORS policies etc. These are the things that someone can model in code. Whole collections of books and articles, conferences, and courses are dedicated to this subject. Several interface definition languages (IDL) like Apache Thrift and Protocol Buffers have been built to tackle API structures. Legions of “REST” service frameworks and HTTP routers litter open source communities.
Identity is simple enough for most people. It is usually delivered as a URL and a certificate, but occasionally “service discovery” systems will tie that identifying information to a more environment agnostic “service name.”
People begin to struggle with implied behavior. I think everyone understands that APIs exhibit certain behavior but the line blurs a bit when considering what behavior is contractual. Data durability or lifetime, consistency models, and idempotence are all good examples of implied behavior that clients depend on. Most clients will learn (and depend on) the behavior of the system even if these features are poorly documented.
Consider a product or service like SnapChat. The structural interface is “how” you use it; the identity is the name or icon for the app, the implied behavior is that when you take a picture with it and send it to someone that the picture will vanish forever after the specified duration. Suppose when a person viewed an image it was only displayed for a second when the sender had specified 10 seconds. I’m sure users would make the claim that the app/service is broken. The same would apply if the pictures had specified lifetimes of 10 seconds but they were actually saved and available forever. Breaching an implied behavioral contract can cause the same customer pain as structural or identity contract violations.
Last, the emergent properties of a contract are simultaneously the most subtle and the most ambiguously documented. The devil is in these details for a savvy service owner. I think it’s best to start learning emergent contract properties with an example.
Suppose I have a document service and each document has an ID with a string structure. The service is the ID authority, meaning clients never get to choose the ID. Now suppose that the IDs that my service vend only ever contain numerical characters. This is an emergent contract property. Even in the face of contradicting documentation I — as a responsible service owner — should not modify my service to begin vending alphanumeric IDs without assessing customer impact. Depending on that impact I might be able to do so without running a proper migration campaign. At a minimum I should notify my customers that the change is going to take place.
Emergent contract properties live within structural ambiguities and behavioral infrequencies. Consider a message queue with “at least once” delivery guarantees. If duplicates are infrequently delivered a non-trivial set of recipients will fail to handle duplicates correctly. While they may be “in the wrong” from the producer’s perspective, the fact is that the producer will be required to deal with the support burden. I’ve heard convincing arguments that producers should always send duplicate messages. While this might increase resource overhead, it will also force consumers to handle duplicate messages correctly.
My favorite emergent properties are latency and latency variance. It doesn’t matter if a service owner publishes latency statistics or not; clients will learn what to expect by using the service. And if those expectations are not met clients will be calling/emailing/ticketing the service owners. There is no way to represent latency contracts in code. Clients could build in timeouts or service owners could use circuit breakers in a load balancing tier, but that is more of an implementation mechanism than an interface contract.
People will not use a service if a service call is equally likely to complete in 20 ms as in 2000 ms. Humans are all about predictability and patterns. For that reason I think latency variance is usually more important than latency.
Consider a service with a variadic lookup interface. Something like:
I see these all the time, especially when backend engineers try to anticipate the needs of people on the front-end. I hate them. Seriously, this is a nightmare of an interface. I could write a book about this crap pattern, but just consider how the emerging property of latency variance could limit or be impacted by implementation details.
CS101 — I can code — implementation approach: take the list of IDs and retrieve them from whatever database linearly. This will deliver what is required, tolerate intermittent database connectivity issues, and could even handle the massive result ordering complexities relatively well. But at the same time this approach guarantees the largest latency variance at the request level. A user that requests one ID will be thrown into the same bucket as those requesting 100 or 1000. What else could we do?
CS301 — hey I know SQL now — implementation approach: take the list of IDs and throw them all into the same SQL query as selection criteria. Now the implementation will get all of the data with one call to the database. The query might be a bit slower but its within reason. This is a fantastic approach and will minimize your latency variance. But did you see what you did there? You subtly constrained your implementation options to databases with multi-record retrieval features to support your interface. Not only that, but in order to actually handle the queries as you’d expect your database will need all sorts of subtle tweaking like efficient plans for infrequent queries with 100 (or 200, or 300, etc) selection criteria. What happens in a few years when you need to pivot implementation to handle scaling events?
CS501 — everything is a cache — implementation approach: just use a cache, blast it in parallel (because we know how to do parallel requests better than the client after all), pray for no misses, and make sure support has a runbook to handle complaints. /kidding-not-kidding
I guess my point in writing all this is that designing, building, and releasing a service interface contract is way more complicated and subtle than “Oh, we use REST.” As I watch “microservice” adoption launch into REST-like levels of hype I just hope people realize that interfaces — the most neglected part of monolithic software development — will now become the most important, difficult, and potentially painful part of service ownership.