For engineering teams, it is important that service levels are consistent with business requirements. To track these, teams can use Service Level Objectives (SLOs) that benchmark specific service level targets against indicators. Google’s Site Reliability Engineering book, published in 2016, shed some light on SLOs, but we haven’t seen significant adoption of the practice until now. We think teams have recently increased implementation of SLOs because microservice adoption has reached a critical mass inside their businesses, which in turn requires greater observability. We believe SLOs will become a best practice because they: 1) offer service health visibility, 2) facilitate data-driven product development decisions, and 3) lessen alert fatigue. While SLOs are known for being used by SRE teams, some businesses are adopting them across the entire engineering organization. By sharing this data across the organization, it can give all participants a clear understanding of the system’s reliability.
Before we dig in, there are four interrelated terms individuals interested in tracking service levels should be aware of: Service Level Indicator (SLI), SLO, Service Level Agreement (SLA), and error budgets.
- SLI is a quantitative measure of some aspect of the level of service. Examples of SLIs include request latency, error rate, system throughput, availability, yield, and durability.
- SLO is a target value or range of values for a service level that is measured by an SLI. For example, an SLO could be an average latency per request under 200 milliseconds.
- SLA is an explicit or implicit contract with users that defines the service standard the provider is obligated to meet. Often, the consequences for missing an SLA are financial like an account credit or reimbursement. For example, an average latency per request of 500 milliseconds could result in a payout to a customer if the SLA states it was supposed to be 300 milliseconds max. The SLA can map to an SLO and is looser than the SLO, as you can tell from the example above.
- An error budget indicates how much time an SLI can be outside the appropriate range before it breaches the SLO. They are helpful for determining if teams are on track to hit targets and whether development velocity is appropriate for stated performance and stability goals.
In speaking with operators, we hear again and again that one of the biggest challenges with adopting SLOs is the first step:defining the metrics. Additionally, because multiple SLIs can feed an SLO, it can be tricky to understand the weight of each SLI. Once those hurdles are overcome, however, SLOs have numerous benefits: 1) providing service health visibility, 2) facilitating data-driven product development decisions, and 3) lessening alert fatigue.
1) Fundamentally SLOs demonstrate a service’s health. Without an SLO, teams lack a programmatic mechanism to state the acceptable level of downtime for the service or if there is a significant issue. Often called “the reliability metric,” SLOs can shine a light on issues that do not result in a full incident. They can operate as a magnifying glass and prevent future issues.
2) SLOs are a mechanism for operations/infrastructures teams to help inform product decisions through data. The metrics can be used to demonstrate the tradeoff between new feature development versus burning down technical debt. SLOs are a data-driven approach to decide if engineers should focus on shoring up features versus new feature development. SLOs inform engineering investment decisions.
3) SLOs have the ability to reduce alert fatigue. By tying policies around alert triggers to SLOs, teams can naturally pare down signals to the few that truly matter.
Preventative efforts to enhance SLOs include load testing and chaos engineering (Gremlin) to find weaknesses in the system. These techniques help unearth underlying factors negatively impacting an SLO and catalyze remediation.
Most of the teams we spoke with built homegrown SLO tracking by connecting to existing observability solutions. Seeing businesses build internal SLO systems, vendors have started offering SLO functionality, like Datadog, Squadcast, ChaosIQ, Blameless, among others.
SLOs are becoming a core piece of the stack for engineering teams. We are excited to watch as SLOs increase in popularity. Further, we believe there is an opportunity to tie SLOs to product features so teams can understand the reliability of different aspects of the product. From there they can correlate this data to support tickets to further understand the product components that need engineering help the most.
If you or someone you know is working on a SLO tracking startup or adjacent offering, it would be great to hear from you. Comment below or email me at firstname.lastname@example.org to let us know.
☞ If you liked this post, please tap the clap icon to promote this piece to others or share your thoughts with me in the comments