Supporting Data Driven Change With SLOs

Jonathan Funk
6 min readNov 23, 2022

In Support of Change

Wherever you may find yourself in the digital landscape; be that developer, site reliability engineer (SRE), product owner (PO), etc, you have likely encountered digital transformation initiatives in some form. If I could give a younger self some advice, it would be to center those transformation conversations around Service Level Indicators (SLIs) and Service Level Objectives (SLOs) when discussing the whys and hows.

Transformation initiatives could involve a process/workflow change on a team, switching a SaaS tool, or a critical change of the underlying infrastructure powering your workloads, and so on. Change can be challenging and that certainly holds true when switching a technology that a team depends on, or is currently generating revenue or value to the organization. It’s a balancing act of maintaining service quality and reliability for your users while bringing in that new technology and ensuring its adoption is a success. After all, that success, and the continuation of it, is why you’re going to all this effort in the first place.

I believe that both SLIs and SLOs can add the necessary data to help drive change, and improve the long-term success of those changes. Based on a personal experience of mine, let’s use a simple team-level change as an example of how SLI and SLO data can help improve software deployment pipelines.

Identifying the Need for Change

Let’s say our highly productive agile team produces web apps consumed by external users. The team consists of developers, SREs, and a product owner. The team closely follows agile principles, regularly delivering high-value features for their end users with sprints. We have containerized development environments, highly-available deployments, and leverage continuous integration and continuous deployment tooling (CI/CD) and practices.

Acting as an SRE team member, something that I regularly do is interview team members and listen to their perspectives on their workflows:

  • What are their pain points?
  • Where does velocity feel slow?
  • What does ‘good’ look like?

Even though process automation and reliability are typically not visible to the end users, there’s an implicit user experience behind it, and being in tune with that is a key aspect of SRE work.

While interviewing developers on the team, a recurring theme has arisen: CI/CD is too slow to provide feedback to developers when integrating changes, resulting in decreased feature velocity — which means fewer features overall! Based on conversations and looking at pipeline logs, we’ve identified that feature branch build deployments to development infrastructure (pre-staging) can take up to 50 minutes, meaning that developers have to wait 50 minutes to know if their changes work with the rest of the system. This has cascading effects such as stacking up features at the end of the sprint and making production releases less frequent and riskier.

Figure 1: Old CI/CD

Note for the reader: I’m using GitHub Actions to run these example pipelines. Mention of slow pipelines is not representative of Actions as a product and are purely examples.

Create a Proof-Of-Concept

Now that we’ve identified a problem that’s affecting the team, we need to determine the ‘how’, which typically begins with a proof of concept (POC) of what the new solution design could be. For this POC we’ll be creating a new CI/CD design and comparing it with the old CI/CD pipeline. CI/CD can obviously be quite complex so we’ll keep things simple. Theoretically, the new design could be making changes such as larger compute resources, parallel builds, caching, changing workload providers, and so on.

Figure 2: New CI/CD POC

But how are we going to compare and relay the outcomes of the new design to the PO in a clear manner (without relying on screenshots of terminal output or log files)? That’s where SLIs and SLOs come in! We’re going to start tracking metrics on both pipelines and measuring that data over time to share with our PO. This can add the much needed data to help make decisions as to whether the experiment is working and worth investing more time into.

Put An SLO On It

SLIs are like a speedometer in a car, measuring our speed. Meanwhile, SLOs are the speed limit and are used in conjunction with error budgets to determine how much we’re allowed to go over the speed limit without getting in trouble.

In technical terms: with SLIs, we want to pull data from an API and store that data as a time-series which we’ll measure with an SLO. We’re going to pick a simple SLI for our pipelines — the average run time over the last 15 days. The RunWhen platform already has open-source SLIs for pulling this data, so we’ll deploy those like so:

Figure 3: Open source SLIs

With our SLIs now pulling metric data from our pipelines we can attach an SLO to them! But how do we define and calculate an SLO? We need to consider how much time a developer is willing to wait for the CI/CD process to complete — and in this example the ideal target completion time is 3 minutes — which will be our threshold when configuring the SLO.

Figure 4: Configurable Multi-Window Multi-Burn SLO

When defining our SLO we configure the time range, the threshold, and the objective which ultimately produces an error budget to go along with the SLO. The time range is how many days of history we want to consider for metrics. The threshold is our “speed limit”, and the objective percentage determines how much of the time we need to be within our threshold. You’ve probably heard of services having a number of 9’s to represent their reliability, where 3 9’s is 99.9%. In this case an objective of 99.9% allocates us an error budget of 0.1% for the month, or 2.4 hours where we can be outside the threshold.

So, what values do we pick for these SLOs?

Well just like with code, SLOs can be iterated upon until you find the right fit. Typically it’s suggested to be loose at first and tighten them as you get a feel for their related services. In this scenario with our team, we’ll assume I agreed upon my SLOs with my PO and team: the pipeline average runtime for the last 15 days should be within 3 minutes 90% of the time. Now that we’ve got SLOs on our pipeline metrics, we can finish off our POC by presenting this data to our team. We’re now getting the team involved with Social Reliability Engineering! We can do a comparison of how the pipelines are performing, and if the new design warrants budgeting for further implementation. Providing these visual aids and factual data can give our PO added confidence in the POC results and gets them involved in the SRE process that’d typically be relegated to a set of terminal screenshots.

Figure 5: SLOs on our SLIs

And there you have it; we’ve supported our POC and changes with factual data through SLIs and SLOs! By getting our team involved with our SRE work through visualizations, SLIs, error budgets and collaborating on SLO definitions they can be more confident in the results of our POC, and therefore it’s more likely to move forward and receive budgeting. Additionally, since we’ve already started SLIs and SLOs during the POC, as it moves forward with implementation it’s done with a reliability-first mindset that has metrics throughout its lifecycle so that the team can see how it’s performing! Ultimately, SLOs have helped us create a more successful POC that promotes data driven change.

--

--