SLI Deep Dive

Jamie Allen
Site Reliability Engineering Leadership
8 min readOct 18, 2021

--

When I talk to people about Service Level Indicators, there is a fair amount of confusion about the work your teams must do to successfully implement them. In my talks explaining SRE, I’ve mentioned that SLIs are the “what,” representing the availability of your system that you can then track and measure through adherence to SLOs for them. I’ve stated in my How To Implement SRE In Your Organization post that you could spend a whole quarter having your teams identify SLIs for the services and applications they support. However, just identifying what you will measure is not enough for a team to have completed that effort. In this post, I’ll outline the expectations for completely delivering SLIs.

How Do You Define SLIs?

It is paramount that you track the most important metrics that will demonstrate success for your users/customers, and there should not be more than three or four. They can exist in a functional application, and they can also exist in a managed service or platform that those functional applications use, such as managed Kubernetes clusters that all services for an organization leverage (and operated by a separate platform team). It is important that teams who deploy using services provided by others get to know what SLIs those upstream dependencies use to measure their own success, so that these users can be certain that those services that they depend on are focusing on the areas that matter most to them as customers. Make sure you have these conversations with such teams providing you with managed services or capabilities.

For many services and applications, there is a tendency to be lazy about defining SLIs and using the Four Golden Signals — Throughput, Latency, Saturation and Error Rate. On its face, this seems to make sense. But out of those four values, which ones directly affect the customer experience? Consider a Customer service that is the single source of truth for customer data inside of a large consumer products company. In this example, it has a web-based HTTP RESTful JSON API for simplicity’s sake, as the technical details aren’t that important. Should each of the Four Golden Signals be SLIs?

  • Throughput: No. Throughput can an underlying cause of a diminished customer experience; as it increases it can impact users because no services are available due to saturation/scaling issues or increased latency. But since it is not a direct metric of customer success, it is not an SLI.
  • Latency: Yes. This is a metric that directly impacts your users, and should be tracked. Measuring latency isn’t a simple task, as it should be considered at various throughputs. You do not care if this reason has to do with saturation, upstream database issues or network routing problems, as those are secondary concerns that you then dig into when issues arise via runbooks (or hopefully automated responses and mitigations, because Runbooks Are Toil). With latency, you do not want to mask “long tail” experiences by using averages. Measure what your latency is at various throughputs in your performance testing, and use those to come up with your SLOs — for example, what is the 95%, 99%, and 99.9% percentiles at 1000 requests per second, versus 5000 requests per second, versus 25000 requests per second. This will create a histogram view of latencies that is far more descriptive of the operational characteristics of your service. For more information, please see Gil Tene’s excellent talks and writings about “Coordinated Omission.” You can Google and find tons of presentations he’s done on the topic, but this is a decent primer.
  • Error Rate: Yes. This also has a direct impact on customer/user experience, as it represents the Request Success Rate of the API return values. How often does your server return an HTTP 500-level response code? Those are the easy ones. But this metric also represents how often your service returns the wrong data for the API call, which is far more insidious. For example, imagine receiving a request for a Customer’s information from a user that does not have authorization to access to that data, or looking up that data and accidentally returning the information for someone else. Those cases are much harder to detect in production, so it’s best to make sure you have thorough functional tests to catch these issues prior to releasing new versions.
  • Saturation: No. Saturation is not a customer-facing metric, though it can indirectly result in latency and error rate manifestations when it is very high. We want our SLIs to reflect the direct experience of our users, not the underlying root cause. Also, saturation is the hardest of the Four Golden Signals to get right, as it is not necessarily obvious what metric you should use for knowing when to scale your service out to more hosts or in to less. (As an aside, I don’t like saying “scale up/down” because it implies vertical scaling, which is host size, as opposed to horizontal scaling to more or less hosts.) You have to develop performance tests that execute in an environment that exactly mirrors production to effectively define the saturation metric you care most about, or you can watch production deployments and do it after the fact, which seems to happen a lot because of how few organizations have test environments that are truly just like a production environment. I frequently see people talk about memory and disk as scaling metrics, but most of my deployments have used concurrent connections as the saturation scaling metric.

Going back to this example, we now have two SLIs for our Customer web service. What others might we consider that directly impact our user experience? I’ve seen posts where people describe Durability and Response Time as separate SLIs to the Four Golden Signals, but I see those as captured by the user experience in Error Rate and Latency respectively, so I think they’re secondary concerns. With SLIs we want to minimize the number because they are the items that we are watching most closely and alerting on, and we should never have more than three or four of them. That said, if someone wanted to use Saturation as an SLI for a web/event-driven service, I wouldn’t argue too much with them about it because it is operationally important and it’s often a useful thing to keep a close eye on. Sometimes, being too religious about what the team should or shouldn’t do is a hindrance to moving on, so pick the battles you really want to fight and the hills you wish to die on with care.

Web and event-driven services are relatively easy for defining SLIs. What can be very troublesome is when you are providing a managed service to multiple tenants, with additional complexity when you have upstream dependencies of your own. You may have heard about some OSS projects removing external dependencies from their software due to the complexities that arise from them and the lack of control they have over the performance of those dependencies. A famous one was Kafka removing its dependency on Zookeeper. Zookeeper wasn’t the problem itself, and is a very well-respected and proven (probably the MOST proven) distributed consensus implementation. But operationally, it added complexity and inefficiency to maintaining and running their service. I can empathize, as I’ve operated platform services that depended on Zookeeper to hold a shared distributed lock to say which host in a cluster was the primary and which were the secondaries. If a primary replica went down, I needed to promote a secondary to be the new primary, and I could only do that within the confines of Zookeeper’s operational capabilities, which did not meet the needs of my most stringent users. In those cases, where the operational performance of something so specific was important to my users, I accepted Primary Replica Promotion Latency as an SLI for the platform service, and I think this is fairly common for any service in the distributed/sharded storage domain.

Many people think in terms of “uptime” and whether your service is “up” or “down,” but this is going to show up in Error Rate. Uptime captures YOUR experience, not that of your user and therefore is the operational equivalent of navel gazing. Focus on the metrics that are most important to your most important users, and you will be successful.

Why Do SLIs Have Values?

SLIs need to be monitored and observed. You need to know their range of values, so you can define an SLO for them. That means it is not enough for your teams to have identified what SLIs they are going to use in order to complete the task of having SLIs. It means that they have to have instrumented them and begun collecting data about their operational ranges, so that they understand what the SLIs look like from their customers’ viewpoint. It means they must not only monitor the SLIs (see them in their current realtime state), but also have observability set up for them (to be able to see queryable historical data about them over time). It is from these readings that they will be able to craft achievable SLOs and realistic SLAs. So when your teams say they “have their SLIs,” always ask them if they are fully monitored and historically observable, otherwise their task is not complete. When they present their proposals of SLIs, they should have a clear idea historically of what the floor and ceiling values for that metric are, at least for the timeframe that they have watched them.

How Do You Monitor SLIs?

It may seem self-evident, but this is one of the areas that grows in complexity with your footprint. I have supported services that had dedicated deployments for specific important customers, and simultaneously running multi-tenant versions of a service for the long tail of customers whose requirements weren’t as stringent. On top of that, do you have dashboard views of each host, or each availability zone, or each physical region that the service is deployed to? Even with only two SLIs for our service, our dashboards can become very complex very fast due to the deployments we need to track.

When it comes to my most important customers/users, I do want to know at a regional level if there is an issue in their deployment. For less important customers, I’m okay with rolling up the operational view to show that there are no issues in any SLI globally. The reason is the importance of the customer. Just like my post about how SLAs Should Be Tiered, your most important users deserve the highest level of service, and how you track your SLIs on their behalf should reflect that. Consider that when you create your dashboards.

What About the Root Causes of SLI Blips?

I mentioned earlier that some things are root causes for why an SLI is not performing well operationally. For example, we can note that latency has gotten considerably worse for our users in the past 15 minutes, and while we haven’t yet broken our SLO, we can start looking into why that is occurring. Latency can come from a couple of different factors, and they may not be within your service. You may be collecting latency data in the users client (such as Google Analytics embedded in a web page), and it could be the user’s connection to the Internet that is the problem. If you have a secondary dashboard for the Latency SLI for the service, you can quickly see if it is your own network connection, or database latency, or any other such concern. You’ll build increasingly useful secondary dashboards for each SLI over time that will help you quickly ascertain the root cause of the issue, or at least point you in the right direction. Try to understand your upstream dependencies well before you go live and predict what might be useful, but remember that perfection is the enemy of good enough. It’s okay not to have all of these in place before you go live, but be prepared to invest in making them better based on the performance of your system in the real world.

I hope that clears up some questions you might have about SLIs and how to define them. Please do ask questions, I’m happy to answer any that you might have.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.