Service level agreement for gracefully degrading services
Charly the customer
Charly is the buyer of cloud services. Charly buys a persistency service or ‘Data Science’ service because his user tell him to do so after having evaluated competing services in trials. In Charly’s view, he buys a single static service. It is the responsibility of the service provider to decide which and how many micro services are used under the hood. They can add, remove, change or replace micro services as long as the static view of the service from Charly’s perspective does not change.
Charly cares about his customers’ (the user) happiness with the service. Whether or not Charly recommends the service to others, depends mostly on feedback from his customers (users).
The SLA of the service is only a concern to Charly in the case of an outage, when he might receive service credits as compensation. At that point the net promoter score (NPS) has already been negatively affected and a small credit compensation will not reverse that.
Ulrike the user of a service
Ulrike does not care about the implementation details of the service or whether the service provider changes something in the background.
She will use some features and functions more often than others and judge the service based on her user experience. Whether she recommends the service to others will depend on the availability of the desired features, whether they perform as expected and return the expected results, as well as other criteria like a well-designed UI or API.
Ulrike is not satisfied when a feature or function she wants to use is not available or when performance is so bad, that she has no result even after she went to the cafeteria to grab a coffee. Ulrike considers some features to be more critical than others. For example, if changing her profile picture is temporarily not possible, she just shrugs and moves on to do something else. On the contrary, she has an important conference call and five minutes before the call, the calendar UI just hangs with a silly hourglass. She can’t get to the passcode and also doesn’t know when the calendar will work again. She misses the call and of course she blames the calendar service. From now on she believes that this service is terrible and lobbies Charly to buy a calendar tool from a different vendor. If her experience had been to see an older calendar version from a cache, she would have been able to dial into the meeting, because the meeting had been scheduled 2 weeks ago. She might not even have noticed that a micro service was down.
Ulrike does not care about a generic SLA of the service, she considers some features more important than others. Furthermore, availability alone is typically not sufficient to measure her happiness (NPS) with a service. The service also needs to be responsive (usable) when available.
Theo the tribe leader
Theo’s mission is to lead the development of services which Ulrike loves and Charly buys because they are very cost competitive. Theo is monitoring the NPS of his services very closely and understands everything about how his buyers and users experience his services. Static measurements of a binary state available / not available are not enough for Theo. He needs to know which features users considers more important than others and focus on the responsiveness of the service components most important to the users. Theo also decides when it is acceptable to respond with stale information from a cache rather than no information at all. He also decides which transactions to fail when the billing system is down. It is a trade-off between happy customers but un-billed transactions, and disgruntled customers.
Let’s talk about the relation between SLA, graceful degradation, and component availability. We’ll use the example of a hypothetical comments service that allows users like Ulrike to comment on all kinds of assets in the Data Science Offering.
Let’s assume that every single component of this service fulfills the 99,95% availability mandate. Note that “available” also means meeting the response time expectations. There is one thing worse than returning an error: returning a slow response. Even though we may not guarantee specific response times as part of a customer SLA, it goes without saying that we should adhere to certain response time limits as part of our availability mandate.
For our hypothetical comments service, let’s use an example architecture similar to many others:
- Shared Nginx as reverse proxy and facade to the outside world
- UI tier to provide user interface
- API tier that models the domain objects and provides CRUD on comments
- Cloudant tier for persisting comments
We can calculate the availability of the comment service based on its architecture outlined above. Every component has an assumed availability of 99.95%. Any request can only succeed if all four components are working correctly. For this reason, Ulrike will experience an availability that is 99.95% to the power of four.
Availability decreases with the number of components in the critical call chain. The only mitigation is redundancy, which comes at a price. This means that the SLA will focus on hitting the sweet spot between cost and availability.
What can be done to improve availability; 99.8% is less than our availability objective (99.95%)? As stated above, redundancy is the solution. One idea is to introduce Redis as a cache for Cloudant in the API component, that will hold information which has recently been read from Cloudant. When the API can’t talk to Cloudant or it doesn’t get the response fast enough, it could still return a comment from the cache if the cache has the value.
Since comments don’t frequently change, there is no issue in serving outdated comments. However, as Cloudant is our persistence layer, the availability of the “add a comment” feature does not improve when adding a cache.
What does that mean? We’ve just introduced a graceful degradation mode into our service. If all things work fine, Ulrike can read and write comments. But when Cloudant fails, Ulrike can still read all comments, which may actually be sufficient for the majority of our customers.
What’s the availability for customers who want to read comments? To simplify the math, let’s assume all the comments are in the cache.
Compared to the original scenario, the API can fall back to reading a comment from Redis if Cloudant isn’t reachable, so there will only be an error if Cloudant and Redis are unreachable at the same time. The probability for this event is (1–0.9995)² which gives us an availability of 99,999975% for the combined Cloudant+Redis tier. This seems impressive, but after adding the other three components the total availability equates to only 99.85%.
Taking into account that the cache also adds complexity and cost to our service, it’s not an obvious decision whether to include the cache or not.
How can availability be improved further? The impact of the additional cache is diminished by the other three components in the call chain that need to succeed before the cache even comes into play. So any availability improvement of the components in the chain before Cloudant/Cache will reduce the critical path but also limit functionality. For example if the UI is available 100% but the API fails we would be able to show empty pages with 100% certainty. For this reason, we’re adding a CDN cache to our UI tier. This will not only benefit availability but also the user experience, since the CDN will speed up asset loading. In case the UI tier is down, Nginx forwards users to the CDN, which holds a static version of the UI. However, in this case no API calls can happen since the UI tier is needed for that.
To conclude, we’ve introduced another graceful degradation mode into our service. From the CDN, we can ship any static content to Ulrikes browser. We can make sure all things Ulrike already has in her browser cache are still usable, but we can’t show any comments she doesn’t have in her cache. So we’re talking about a scenario, where Ulrike can still view the site, including comments she already has in her local cache, but she can’t load new comments or add new comments.
Arguably we could do some further steps with using browser local storage to queue up failed comment adds to still deliver the “add a comment” function to Ulrike. Twitter has some good examples for such mechanisms.
Still, the same points as with the cache apply: a CDN will increase cost and add complexity to our service. While the availability improvements are very good in this case, the value of the scenario to Ulrike may be problematic here. It’s not an easy decision in this case either.
Let’s step back and look at the different degradation modes we’ve introduced and how they may tie to an SLA that Charly cares about. We started with the basic scenario that allows Ulrike to add and read comments. This scenario has no fallbacks and therefore it has the lowest availability of all scenarios.
As the first improvement, we’ve added a read cache to use in case the persistence service can’t be reached. Now, Ulrike could read comments, but not add new ones.
A further extension to this could be to also use the cache for caching writes.
We realized that the depth of the call chain plays an important role in how much positive impact redundancy has to overall availability. So we thought about giving the UI tier a peer in a CDN, that ensures that Ulrike at least can view the UI and look at all the things she already has in her browser cache. We could extend this concept further by queuing comment adds in the browser, so that Ulrike is less likely to notice that comments can’t be saved right now.
Which of these failure modes is referenced directly or indirectly in the SLA, depends on the value of the capability that the service brings to the platform. After all, Charly bought a Data Science offering and not just a comment service.
On the specific case of comments and their relation to the overall offering, we should take graceful degradation into account in the SLA. The SLA shouldn’t declare the platform as unavailable when the comments service is down. We should focus on what we declare essential to delivering the value of the platform (E.g. Notebooks, Spark). These essential features must be available in high degradation modes. This means running these services is more expensive, but Theo can make an informed decision about the cost associated.
For reasons like disaster recovery, the eventual goal is deploying into multiple regions. While this has the highest cost associated, it also has the biggest impact on availability. One aspect we omitted so far is failure domains. Every region is a failure domain, so running active/active in at least two domains decouples us from any problems that may occur at a region level. Active/active also imposes new challenges on our services related to consistency of persistent state. Let’s do the math on what availability of our comments service looks like if we would deploy to two regions and run active/active. Requests will only fail if both regions are down at the same time, which produces decent availability in the range of three nines.