How to sell SLOs to Engineering Directors

Thomas Césaré-Herriau
Brex Tech Blog
Published in
7 min readDec 18, 2020

--

We recently started our Service Level Objectives (SLOs) journey at Brex and thought it would be valuable to share our learnings with this community. This blog is a redacted internal memo that aimed to familiarize SLOs with its audience, explain the value of an SLO culture, and describe how we would implement and roll them out.

We hope you find it helpful.

As a side; I internally nicknamed the memo S.L.A.E.I.O.U after this excellent Brazilian disco track. I recommend listening to it while reading!

From: thomas@

To: engineering-directors@

Date: 06/15/2020

Subject: S.L.A.E.I.O.U — An SLO Strategy proposal

Executive Summary

To successfully scale with our customer base there are three aspects that are critical to maintaining and further improving trust;

  1. improving reliability;
  2. controlling risk; and
  3. increasing iteration speed.

All aspects that are only achievable if we have an accurate way to measure our services. This is best accomplished by defining and capturing Service Level Objectives (SLOs).

With SLOs, we will be able to estimate the business impact of our services’ reliability, and decide which services can be iterated upon more quickly, by targeting lower reliability, thus accurately understanding and controlling the risks taken.

Background

In order to achieve the above we know we need to;

  • distinguish High Reliability from High Velocity projects, and iterate as fast as the associated risk tolerances for each allows.
  • continue to increase Brex’s overall reliability as we expand our customer base.

Despite early Brex being built as a modular monolith, we have started undertaking several efforts to decouple our services, improve the reliability of our core systems, and increase overall developer velocity. However a key component of being able to achieve these goals, is to be able to track how well our systems are behaving from our users’ perspective in order to make data-backed decisions based on the type of project.

This is where SLOs come in.

In a nutshell, an SLO is an agreed upon target of a quantitative indicator determining how well our service is behaving from our users’ perspective.

Glossary

Critical User Journeys, SLIs, SLOs, SLAs and error budgets are key concepts from Software Risk Management introduced by Google SRE and widely used throughout the technology industry. Here are short definitions:

Critical User Journey: any user facing core feature, for example:

  • Credit card authorization
  • Displaying transactions in the dashboard
  • Receiving a transaction approved SMS

Service Level Indicators (SLI) are a measurement that are used to determine if a system is healthy or available from the perspective of the user (i.e “Percentage of succeeded requests” or “Latency of an action”)

Put in other words, it’s a quantifiable measure of reliability

Service Level Objectives (SLO) define a healthiness or availability goal for a system based on a SLI (i.e “99.5% of requests succeed” or “99% or requests take less than 200ms”)

Service Level Agreements (SLA) generally define a legally binding contract with a paying customer about a set of SLOs (i.e “If less than 99.5% of requests to your service succeed, a refund will be issued”)

Error Budget, derived from an SLO, is the amount of time per period (usually 7, 28, 30 or 90 days) during which a service can violate its target SLO.

A 99.9% SLO has a 0.1% error budget over the period used.

Service Scorecard: a methodology to quantify the operational quality of a service. It generally uses checks or measures such as:

  • Does the service have an oncall rotation assigned to it?
  • Does the service have alerts defined?
  • Does the service have a dashboard?
  • Was the service’s architecture reviewed and approved?
  • Test coverage
  • etc.

Goals

  • All Critical User Journeys have defined and associated SLOs.
  • SLOs are used as part of OKRs and project planning processes.
  • High Reliability / High Velocity projects are accurately described with corresponding high / low SLOs.
  • Supporting services that do not directly expose a user-facing feature also have defined SLOs.
  • As a Brex engineer, I can instantly understand the health of my systems.
  • As a Brex employee, I can instantly understand if our product is operating as expected, and if not, what areas are experiencing issues.
  • As a Brex customer, I can access a status dashboard that clearly describes the current status of Brex products.

Why SLI, SLO and SLA matter

Business Impact

Capturing SLOs across our main Critical User Journeys will allow tying reliability and availability to a dollar business value. As our business grows, a non-increasing availability over time will have an increased business impact.

Let’s take a look at a simple example, an online ecommerce business, and the reliability of its payment flow that happens once a customer has put items in their cart and are ready to pay for their order. The SLO is defined as 99% of the payments initiated are successfully processed. If we don’t process the payment, the customer will get an error and for the purpose of this example, we assume they will give up on their order.

  • Initially, we process around $1M of orders every 30 days.
  • For the purpose of the exercise, we assume they are distributed evenly over those 30 days.
  • We expect to not handle properly 1% of them, so that’s $10,000 of revenue every month.
  • It may not be worth dedicating engineering resources to “add a 9”.
  • Now, if we were to grow our GMV to $10M a month we would then be losing $100,000 in revenue every month.
  • Probably time to dedicate or hire engineers for this problem!

The part that is not captured in this example is the customer trust that we erode every time we decline a transaction inaccurately. This is harder to quantify, but will definitely have an amplified impact as we expand our customer base.

Tactical impact

The goal of implementing SLOs is to make informed decisions about what systems we invest engineering resources in, and how.

As we can tie a specific SLO to a business value, this will allow Product Managers and Engineering Managers that are planning projects to know how much resources to allocate to tech debt / improving the reliability, and how much to dedicate to feature development.

This is also going to be key to clearly, and with data, distinguish High Reliability and High Velocity services:

  • High reliability services require high SLO (>=99%)
  • High Velocity services are OK with low SLO (<99%)

During the design phase, SLOs and SLAs are keys to be able to understand the availability implication of different external services (AWS RDS, AWS DynamoDB, Amazon MSK) as well as internal services (Events Infrastructure, other dependencies). SLOs clearly define the assumptions engineers can make when depending on other systems. Based on the level of availability one new service should achieve, using an SLO/A rich environment will facilitate the decision.

Operational impact

“It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors” — Google SRE

Traditionally, alerts and customer reports are used to understand whether services are functioning or not. While this helps detect failures, it does not indicate whether over time we are serving our customers well. Alerts help us mitigate risks, not measure it.

Properly defined SLOs will increase developers velocity by increasing their confidence in releasing new changes, and interacting with other services, for the following reasons:

  • Clearly defined SLOs will make sure that systems have well defined capabilities and track the performance of those.
  • SLOs help to ensure that performance of the underlying system stays consistent as the system evolves.
  • SLOs are similar to testing, but at runtime in production: it measures whether our Critical User Journeys are behaving as intended.
  • SLOs derived alerts quickly and accurately capture user-visible issues.
  • By setting two levels of SLOs (one external and one, tighter, internal) we can prevent issues from becoming user-visible.

By capturing in real-time how well our systems are behaving according to our Critical User Journeys, it will allow us to provide detailed status pages to internal teams that will help quickly understand a reported user issue.

Plan

The below has been modified to provide more generalizable information:

In phase one, focus on one well-known high reliability team and one well-known high velocity team to define and implement an initial set of SLOs. This is a great opportunity to build rapport and develop tooling and documentation to help subsequent teams.

Remember to add a SLI/O section in your design doc templates to create self-reinforcing processes!

In phase two, identify a subset of services/features to have SLOs defined and work with management to prioritize this work. The output of Phase 1 should help other teams self-service this work. This will continue to increase the organization’s knowledge and practice, slowly building the culture.

In phase three, require all services to have SLOs defined before launching to production.

Alternatively, a forcing function such as a Service Scorecard can be used.

Why now?

SLOs are a cultural shift. It is about understanding our systems from our users’ perspective, ensuring that what we build provides the quality of service our users deserve. It is about measuring and tracking what matters.

All teams have an understanding of what their services do, SLOs are about formalizing them. It is easy to start, hard to get it right at first and it requires iterating until we get to an understanding of what to track and how, and how an SLI relates to the perceived quality of service from our users.

As such, the earlier we start defining and using SLOs the better.

An SLO-th (Image by Minke Wink from Pixabay)

That’s it! A version of this memo was used to successfully gain buy-in and schedule time for creating SLOs across the engineering organization. In the following posts, we’ll talk about the progress, obstacles, and learnings of our SLO journey.

Key takeaways

  • Use the language Directors care about: be business centric.
  • Explain the value of SLOs at different levels: strategic, tactical and operational.
  • Ensure you are able to provide close support to early adopters, and go for a gradual rollout with a few pilot teams initially. Be customer centric.
  • Rolling out SLOs sooner than later will allow a culture of reliability and risk management to slowly take its roots, and to build the foundation for successfully managing massive growth of your customer base.

SLOs are a powerful tool to increase your engineering organization’s maturity in terms of reliability and risk management: use them!

Are you passionate about SLOs? Or interested in building reliable products in a fast-pace environment? Come join us at Brex!

--

--