SLO Management

Published in

DraftKings Engineering

5 min readSep 19, 2023

Error Budget — Motivation
Terminology
Choosing SLI Metrics
Steps to Implement SLOs and SLIs
Common Pitfalls and suggestions on how to avoid them
Examples

Service Level Objective (SLO) management focuses on measuring and improving the reliability of services by setting and meeting specific targets for performance and behavior. This approach helps align the expectations of service providers and customers. It provides a framework for improving service reliability since service providers can define clear performance targets and metrics that meet the needs of their customers. This ensures transparency and accountability and enables effective communication and collaboration, fostering a mutually beneficial relationship based on trust and understanding. SLO management allows service providers to prioritize investments and improvements based on business needs, meet customer expectations, and provide a better overall user experience.

Error Budget — Motivation

Product development performance is mainly evaluated on product velocity, which incentivizes pushing new code as quickly as possible.
On the other side, site reliability (SRE) performance is evaluated based on the reliability of a service, which implies an incentive to push back against a high rate of change.
The error budget provides a clear, objective metric that determines how unreliable the service can be within a specific time.
This metric removes the politics from negotiations between the SREs and the Product when deciding how much risk to allow. With this, deciding on stability or a new feature is easier.

Terminology

SLI (Service Level Indicator) — a metric that represents a specific aspect of the performance or behavior of a service. For example, the latency of a web page load time; error rate, availability, and latency of an API; consumer lag, processing time, and failed messages for queues.
SLO (Service Level Objective) — a target set for an SLI or SLI, representing a level of performance or behavior that a service should meet over a given period. For example, a web page load time of less than 2 seconds or an API error rate of less than 1%. An example of an SLO that combines more than one SLI is a 99% success rate with a response time of less than 500ms for an API request.
SLA (Service Level Agreement) — a legal contract that outlines the agreed-upon service levels between a service provider and their customer. It typically includes specific targets for SLOs and outlines the consequences if they are unmet. Other terms and conditions may also be included in the SLA.
Error Budget — the amount of acceptable errors or downtime within a specific period that a service can experience without violating its SLO. The error budget represents the tradeoff between service reliability and innovation, as it allows a service provider to invest resources in new features or improvements while still ensuring that customer expectations are met.

Choosing SLI Metrics

Too many SLI metrics can prevent engineers from focusing on the most critical performance indicators. System Boundaries are the points where components expose capabilities to customers, such as authentication. Concentrating on the boundary inherently captures the performance of the various elements involved in telling the capabilities to customers. Choosing an SLI for internal/infrastructure components, e.g., storage, caching, and queues, is per the component type.

Steps to Implement SLOs and SLIs

The basic steps for selecting and implementing SLI/SLO metrics are:

Identify a helpful system boundary on your platform
Identify the associated customer-facing capabilities at the border, for example, front-end-facing API
Determine what it means for these capabilities to be available for the customer, for example, low latency or fast response time
Use the definition of availability to define one or more SLI metrics, for example, latency < 50ms, response time < 150ms, and response code != 5xx
Start measuring the SLI metrics to get a baseline performance percentage
Based on the baseline performance, define an SLO for each capability

Each logical instance of the platform should get its own SLO, for example, Domain-X Availability
Multiple SLIs for a single capability should be combined into a single SLO for that capability, for example, Domain-X response time < 100ms and latency < 50ms

Track how the SLI performs against the SLO over time
Track SLI correlation with customer satisfaction indicators
Calibrate the SLI data until it matches customer satisfaction and meets the SLO
Use data analysis tools to identify trends and opportunities for improvement

Tools — Many tools are available for monitoring and measuring SLIs and SLOs, such as Prometheus, Grafana, and Datadog. These tools allow service providers to track key metrics in real time, set alerts for when targets are not met, and analyze data to identify trends and opportunities for improvement.

Common Pitfalls and suggestions on how to avoid them

Setting unrealistic or unachievable SLOs

Setting overly ambitious or unachievable SLOs can lead to frustration and demotivation. It is crucial to set SLOs that are both challenging and realistic.

Start by analyzing historical data and understanding the service’s current performance.
Consider the tradeoffs between user experience, cost, and operational feasibility. Regularly review and adjust SLOs as needed.

Insufficient capacity planning

Failure to anticipate and plan for future capacity needs can result in SLO violations during peak usage periods. It is essential to assess the service’s capacity requirements and plan accordingly regularly.

Monitor usage patterns, conduct load testing, and implement appropriate scaling mechanisms to handle increased demand.
Continuously analyze and optimize capacity to prevent capacity-related SLO violations.

Neglecting customer expectations and feedback

SLOs should align with customer expectations and business requirements. Neglecting customer feedback and failing to understand evolving expectations can lead to dissatisfaction even if the service meets its technical SLOs.

Actively engage with customers, gather feedback, and involve them in setting and refining SLOs.
Regularly communicate service performance and seek feedback to ensure customer satisfaction.

Lack of accountability and ownership

SLO management requires clear accountability and ownership across teams and individuals. With clear ownership, driving improvement and resolving issues effectively becomes easier.

Define roles and responsibilities for SLO management, ensuring that individuals or teams are accountable for meeting the defined objectives.
Foster a culture of ownership and encourage collaboration to drive continuous improvement.

Examples

We want to measure user satisfaction, and we will do it by defining SLIs, SLOs, and thresholds that will indicate user satisfaction.

SLO:

Login is available 99.9% of the time over a period of 28 days
99.9% availability results in an allowance of 40.32 minutes of downtime per 28 days (28*24*60*0.1/100)

SLIs to support the SLO:

99th percentile requests are successful

AND

99th percentile latency is less than 2s

SLO thresholds

99.9% — user very satisfied, gets excellent service, can log in and bet.
95% — user satisfied but needs to wait sometimes
80% — user dissatisfied

Between April 3, 2022, and May 1, 2022, we have the following measurements:

Total requests: 1,000,000
Total successful requests (matching SLI conditions): 998,000

Availability(%) = 998000/1000000 = 99.8%

Error Budget at this point is burned out; further deployment is banned or must be reviewed

Between April 3, 2022, and May 1, 2022, we have the following measurements:

Total requests: 1,000,000
Total successful requests (matching SLI conditions): 999,000

Availability(%) = 999500/1000000 = 99.95%
Error Budget at this point is burned 50%, still have 0.05%. We can deploy

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!

References: