In the first of a two-part series, Lucian Craciun and Dave Sanders share their rationale for implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs), while revealing how they have helped to unify teams and departments around a common goal. They also show the steps they, and other organisations, can take to put the principle into practice.
The Problem: Seeing the Signal Through Noisy Systems Alerts
Alert fatigue is a real thing. Until recently, our technical team — including senior leaders — regularly received a high volume of alerts on deep system metrics, such as disk space for specific storage arrays and CPU utilisation for individual servers. While there was a time when technology managers needed to actively monitor these kinds of granular metrics, today, these alerts just create distracting noise. For example, in a dynamic cloud environment, if a disk gets full, it shouldn’t trigger an alert because that disk will automatically get replaced.
These noisy, non-contextualised alerts were becoming a major problem for us — and they were causing us to lose focus on what was really impacting the business. To assess and maintain the health of our distributed systems running high-scale services on cloud platforms, we needed a new performance monitoring strategy that could deliver meaningful and actionable information.
The Solution: Monitoring with SLIs and SLOs
We weren’t alone in our frustration with noisy alerts. Google published the Site Reliability Engineering (SRE) book in 2016, followed by The Site Reliability Workbook in 2018. Both of these books revealed a compelling, new way to manage large-scale, dynamic, customer-facing systems. Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are a key pillar of SRE and are the principal tool for eliminating needless alerts and focusing on what matters to the business.
The premise of SLIs/SLOs is that all teams — product, architecture, development, and platform — need to look at services from the customer’s perspective. Instead of basing your SLO on the idea that “this service must be up and running because it has to be,” you try to meet your SLO so you can make your customers happy. SLIs/SLOs completely shift the mindset from “I’m responsible for Service X in a very complex, vague backend environment way” to “If I don’t meet this SLO my customer is going to be unhappy.” For us, that was when the light bulb turned on.
At many companies, including The Telegraph, architecture and ops/platform teams care deeply about ensuring that services meet nonfunctional requirements, such as security, availability, and reliability. Architects and ops teams can be perceived as a nuisance to developers when they bang on about: “You have to make sure your service is performant, secure, and available” or “This component must respond within two seconds.” SLIs/SLOs are a tool that totally changes this dynamic by unifying teams around a shared goal: improving the customer experience.
SLOs/SLIs are an excellent lever to get buy-in from developers to build services that are more available, more functionally correct, and faster. Together, product owners, architects, and developers define the top 2–4 behaviours they want from each of their services (as experienced by customers) and negotiate an appropriate objective. In the end, the SLO requirement is baked into the development team’s sprints, and developers no longer need to incorporate seemingly arbitrary demands from architects and platform engineers.
We have a well-established DevOps culture and practice at The Telegraph. Google’s SRE thought leaders say, “SRE implements DevOps,” so we see SRE practices as a next step in the evolution of our DevOps journey. By implementing SLIs and SLOs, we are reinforcing our DevOps mindset and culture and building a strong foundation for adopting other SRE practices.
Throughout 2017 and 2018 we were in learning mode. We benefited from discussions with our colleagues at Google and utilised the resources they offered, such as classes and tutorials. By the end of 2018, we were ready to implement the change.
Our goal was to implement SLIs and SLOs for each of the new services we launched or re-launched in 2019 (which ultimately turned out to be 14 services). And our ongoing goal is to implement SLIs and SLOs for our existing 100+ services as we modify them over time. Our CTO sponsored this initiative, and our Architecture team drove this project with support from the rest of the technology organisation.
Below, we’ll outline our general approach to specifying and implementing SLIs, and walk through three concrete examples.
Phase 1: SLI Specification (Product + Development + Architecture Teams)
Typically, our Product and Development teams get paired with a solution architect when they start a new project. An important part of the architect’s role is to ensure the service meets key nonfunctional requirements. To accomplish this, the architect facilitates discussions between product and engineering to ensure appropriate SLIs/SLOs are incorporated into each project implementation.
Step 1: Define the service boundary
To effectively measure a service, you have to define the boundary of that service. A service can be pretty much anything that provides functionality to a user. For us, this means that each of the following would be considered a separate service:
- an API
- the telegraph.co.uk site
- the login page
- the registration page
- the mobile app
- the RSS feed
- the journalists’ Authoring Tool
- many more
Step 2: Determine the 2–4 behaviours for each service that matter most to users
The next step is to identify how you want each service to behave for users, factoring in business context and user expectations. The Product team’s input is key at this stage because they act on behalf of the customer. After this, we define metrics that will accurately measure the desired behaviour of the service from the perspective of the user.
Each SLI is calculated as a percentage of the number of “good” events (however defined by the SLI specification) divided by the total number of “valid” events (over some period of time), as shown in the formula below.
A “valid” event is contextual and can vary — it is up to the technical team to define as part of the SLI specification. For example an HTTP 401 (unauthorised) response might be a “valid” response for an identity API and considered as part of the SLI, but the same response might be unimportant or ignored for a secured content API.
Step 3: Negotiate the objective for each service behaviour
The SLO is the target percentage of good events out of the total number of valid events (over the relevant time period). Defining SLOs is a fascinating exercise, because it involves a trade-off between realistic user expectations and the engineering effort required to meet those expectations. We can attempt to design a service with 100% uptime but we might need years to do it — or we can settle on three 9s of availability (99.9% uptime) and build it in a month with 10x fewer resources. We also need to consider other factors, such as the anticipated release schedule and how each release can impact the service. Negotiating SLOs requires us to weigh resources, time, and business objectives.
The “error budget” is the number or proportion of “bad” events (out of the total valid events) that can occur before the SLO is breached. In theory, your SLIs should have reasonably comfortable error budgets separating them from their target SLOs. These error budgets can be spent through the inevitable errors that come with new releases, enhancements, and upgrades.
Phase 2: SLI Implementation (Architecture + Platform Teams)
As we shift from specifying to implementing our SLIs/SLOs, the work shifts from the Product team to the Development and Platform teams. The architect remains the lynchpin, ensuring that SLIs/SLOs are fully implemented.
Step 4: Define a concrete way to measure SLIs
Inevitably, a single service will depend on many others (e.g., your CDN, Identity and Access Management provider, or API Gateway). And if you measure your service over the Internet, any latency measurements will include a lot of network hops in the middle that have nothing to do with the performance of your service. As such, these measurements will be misleading. That said, it’s important to master a balance between isolating your service as best as possible and measuring its performance from your users’ perspective.
Step 5: Build dashboards with alerts
We’ve been building various SLI dashboards in Datadog that are customised to the roles of various individuals and how much information they need to do their jobs. For instance, the Platform team needs to pay close attention to each SLI. If an SLI isn’t meeting its objective, the chart switches from green to red on the dashboard, and alerts are triggered. Members of the Platform team — including the head of the team — look at these dashboards every day. The head of the Platform team (Lucian) has a personal dashboard in Datadog that displays the real-time status of every SLI/SLO we’ve built and every rollup SLI/SLO. (Rollups are useful for summarising service health. Each rollup reports multiple SLIs as one: You simply aggregate the number of “good events” and divide by the total number of valid events.)
SLIs and SLOs are a powerful way to communicate service health with stakeholders. We’re currently building a dashboard for our CTO. While the Platform team needs to monitor SLIs for each component of the platform, our CTO simply needs to see a high-level summary of platform health (in effect, a giant rollup of all those SLIs). We’ll aggregate the data into an overall platform SLI, and below that we’ll likely display one key indicator for each underlying service. If anything is over the error budget it’ll go red so the CTO can notice it right away.
Step 6: Get confidence in the SLI measurements
At this stage, using error budgets as a basis for decision making is still not possible. Once you are actually reporting SLIs, it takes time to get confidence in what you’re reporting, and then get that behaviour to align with what’s being reported.
Inevitably, you’ll need to work out some kinks before you can build confidence in what’s being reported. For example, it’s important to make sure that invalid events are being filtered out of the SLI measurements. When we first started, some SLIs were reporting values that didn’t necessarily align with what was actually happening on the customer side. So we dug a bit deeper and discovered that we were mistakenly considering some events to be valid when they were actually invalid. An example might be an HTTP 404 where a user has put in an invalid URL. We might not want to consider these cases (user error) in our SLI/SLO.
Getting confidence in the veracity of SLI measurements is a big hurdle. Without overcoming that roadblock, it’s difficult to change behaviour. Fortunately, we’ve arrived at a point where we genuinely believe the numbers being reported in the dashboards. However, at this stage, we still aren’t prepared to take any additional risks just because we are under our error budget.
Phase 3: SLI Operationalisation (All Teams)
Now, we need to embed SLIs into every team’s workflow. It’s one thing to have accurate measurements and pretty dashboards — it’s another thing entirely to use these tools to continuously monitor and improve the customer experience. In the steps below, we’ll explain how we delivered on our SLIs.
Step 7: Set alerts, escalation paths, and regular reviews based on SLIs/SLOs
Today, nobody needs to get low-level alerts. Instead, for any new service, the technical team prioritises user-focused SLI measurements. If one of the indicators goes down for a certain period of time, they consider that service down — an on-call engineer immediately gets alerted and an incident is created. We never wait for an SLO to be breached before we loop in the engineers. The goal is to spend the error budget on releases (if downtime is required) or any other A/B tests the team might want to run, and generally use the error budget on business value projects rather than incidents. When the Platform team gets alerted, the head of the team helps serve as an escalation point, rather than receiving alert notifications about SLIs and SLOs.
Both of us report to The Telegraph’s CTO, Toby Wright. We share SLI/SLO results with him by reviewing a snapshot of the Datadog dashboard in our weekly team meeting. When appropriate, these SLI/SLOs are used as a basis for escalations. Since they display hard data that directly correlates with customer’s experiences, they paint an unambiguous picture of where to focus our efforts when we experience problems. They are also useful for investigating issues reported by our contact centre. When our customers can’t log in to the website, they certainly let us know about it!
Step 8: Automate SLI/SLO implementation for new services
Currently, once the Product, Development, and Architecture teams define everything they need, the Platform team runs its scripts and creates the SLI/SLO dashboards in Datadog. At the moment, the Platform team automatically creates one alert per SLI and the team is able to go into Datadog and add more alerts (for example, warnings about error budgets indicating that SLOs are close to getting breached). However, the Platform team is layering in more automation and building SLI/SLO dashboard templates in order to scale this initiative.
The Platform team is currently building a self-service internal Platform as a Service (PaaS) that bakes SLIs/SLOs right into the deployment process. The goal is for the developers to specify what they need (e.g., “I have this new API, these are the SLIs, and these are the SLOs. Click.”) — and then the PaaS will create that infrastructure for them. The SLIs/SLOs should be defined as code and stored in the same repository as the application code. The Datadog infrastructure (integrations, alerts, and dashboards) will be defined as a JSON or YAML config file in the same code repository, and a Jenkins pipeline will actually take that config and turn it into Datadog dashboards and alerts.
Step 9: Use error budgets to negotiate updates and new releases
We mentioned earlier that we prefer to burn our error budgets through business value–adding updates and releases rather than incidents. We’re just beginning to reach this final stage of maturity in which we strategically use error budgets as a key input for our release roadmap (negotiating updates and new releases), as well as for periodically renegotiating SLOs.
Part 2: Real-life examples of SLIs and SLOs in action
In Part 2, Dave and Lucian will describe the scenarios where SLIs and SLOs have been implemented, how they took shape and the benefits that doing so have brought The Telegraph.
Lucian Craciun is Head of Technology — Platforms & Engineering at The Telegraph.
Dave Sanders is Head of Technology — Newsroom at The Telegraph.