SRE Leadership: Have Tiered SLAs

Jamie Allen
Site Reliability Engineering Leadership
6 min readMar 18, 2020

When you own a service, it’s important to have a very clear idea of who your customer/user will be. Many product teams have activities for setting up customer archetypes, and outlining what success looks like for each of those customers. As SREs, it is incumbent upon us to take those archetypes and define what a successful Service Level Agreement (SLA) for each of them will be, as well.

Even if your organization does not define archetypes, you wouldn’t be a good production service provider if you didn’t have at least one concept of who your user would be. From that, combined with production usage data, we can derive the Service Level Indicators (SLIs), the Service Level Objectives (SLOs), and finally the SLAs as a commitment to those users that we will deliver some specific level of service. Over time, these may change, especially in a distributed microservice world where your service may be onboarded by another without tremendous thought from that user about what you need in order to be a successful provider to them. That isn’t ideal, but it’s a reality of today’s world.

As a service provider, you can not, and should not, be everything to everyone. I have supported services that were initially created to service the “long tail” of users (a generalized solution to prevent many services from trying to implement something hard wrongly) that has then become primary infrastructure for a core technology with significantly higher requirements. That is a very difficult environment for success as a service provider, where your customer needs suddenly ramp a great deal, but your SLA was created for a completely different kind of customer.

As a service provider, you need to have a frank conversation with leadership about whether or not onboarding this user makes sense. Despite the difficulty for your team, it may still be the right move for the greater organization, but if so, how should your team support that new customer, and what will be the organization’s plan of attack for successfully managing this partnership? Can we say there is a clear data point identifying when that user should no longer leverage the service you provide? And what do you need from this new user in order to successfully meet that SLA? It may also turn out that the new user should instead have their own, custom solution instead of leveraging your generalized service, as their needs are too great for you to support. You may end up carving out specific functionality/capabilities just to support that one user, at the expense of all other customers. Is that the right move? This is a question that must be revisited at every major feature/design change request. Many of us in the infrastructure world have seen ZooKeeper used in ways it was never intended — as a distributed key/value store with high availability. By having these discussions, you can avoid “service abuse” by customers looking for a quick, easy solution to their problem that may not be a great fit for your service.

Assuming it is the right move to onboard this customer, it now falls on your team to define what level of service they are comfortable guaranteeing for them. A review of the SLAs you currently have in place, as well as what you need from a service onboarding, must take place. As an example of a requirement of a downstream service that you onboard, imagine a large-scale distributed service architecture where maintenances frequently take place. Your service may provide customers with lookup information about the location of specific hosts that they own as part of their capacity entitlement. However, if your service is notified by the underlying infrastructure that a maintenance event is about to take place, you may need to communicate that downstream to the affected subscribers. In that case, you need an SLA from *them* in order to provide a timely response to your upstream systems. If the service you are about to onboard wants an SLA from you, they must first meet that contract that you require and implement that interface with timely responses. Failure to do so may mean that their hosts are decommissioned or made unavailable without them performing some required action on their part beforehand.

Tiered SLAs give your team the ability to structure your relationship with different kinds of customers in different ways.

  • The Top Tier: Using that previous example, I may be providing critical infrastructure components information about the location of their hosts, and they require very stringent SLAs with minimal failure to service requests. In turn, they must be willing to respond to maintenance/decom requests very quickly, so that we can in turn notify our upstream infrastructure provider that we are ready.
  • The Medium Tier: In this case, we have customers for whom an outage would be painful, but it would not be critically damaging to the business. In this case, we make smaller demands of what we need from them to provide that level of service.
  • The Lowest Tier: These are customers for whom an outage is a non-event. These are typically batch jobs or analytics platforms, where it may be an issue, but they can restart quickly elsewhere with minimal impact. These are almost never services that are directly leveraged by a customer of the overall organization/corporation.

It is possible that your service will have no SLA. This is not an ideal scenario, but it is one I typically see in a few critical services at large scale cloud providers and companies with the greatest scale imaginable. Imagine a service discovery layer, which is usually some mega implementation of the observer pattern. You update information about the hosts where your service is currently deployed, and that has to propagate out to all of the reverse proxies that support that service, which could be tens of thousands of hosts. In those cases, providing you with an SLA about how long it will take before all reverse proxies are updated is a guessing game for various reasons — network traffic, host request saturation, etc. We accept that an SLA isn’t provided so long as the system continues to work in concert with our location updates, but this is also why we do “rolling” updates of our service with canary deployments, so that we don’t update all locations at once.

So how do you find the appropriate values of SLAs for your customers? This requires data. If we use an SLI such as “request error rate,” we can track the errors for all customers and plot them in a graph. Hopefully, we will see striations in the data that make clear how we can separate them. Note that I have seen some that prefer to provide SLAs in a histogram for metrics such as these (a different percentile at different throughputs, which is very useful for SLIs such as latency):

  • Top Tier: 98% of requests will not fail due to internal server errors
  • Medium Tier: 95% of requests will not fail due to internal server errors
  • Lowest Tier: 90% of requests will not fail due to internal server errors

If an obvious tiering striation of the data isn’t visible, you have to think in terms of what you’re comfortable guaranteeing for each level. It may be the case that there is only 2 tiers of service, not three. I don’t recommend having more than 3–4 tiers, however. You don’t want your service agreements to be so complex that you support all of them badly. And, per the Art of SLOs, if your SLIs aren’t clearly describing when a system is starting to fail due to metric variance, you may not have an effective SLI.

Your SLAs are a mechanism for clearly setting expectations about what your customers can expect from you. But you do not want to be everything to everyone. Make sure to protect your team, and your customers, by providing them with SLAs that make sense for the level of service they need. And don’t be afraid to use these metrics to define what a good citizen in your ecosystem looks like as well.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.