Driving Business Value Through SRE

Jamie Allen
Site Reliability Engineering Leadership
3 min readApr 22, 2022

I’ve had some experiences lately that made me think about what we’re trying to achieve as SRE leaders. In this space, it is very easy to be consumed by the concept of availability — that we are laser-focused on improving the SLIs that define availability for a service, and the MTTF/MTTR (Mean Time to Fail/Mean Time to Repair) metrics that reflect our trending improvement and ROI for SRE efforts. And that’s good, in that we need to keep a vigilant eye on these critical success criteria and KPIs.

However, we need to keep the big picture in mind. It doesn’t matter what our service is doing if a customer experience is negatively affected. At the end of the day, nothing else is more important than that. If our service is available, and for whatever reason the customer cannot use it, it doesn’t matter. I’ve discussed this before when I’ve said that uptime doesn’t matter.

As SRE leaders, we are not delivering value to our organizations if our customers are not successfully using our systems to drive the capabilities and experiences that strategically drive business value. We can have services that are available per our SLIs (Service Level Indicators), but we have to extend our viewpoint outside of the technical sphere and focus on enabling critical business experiences reliably. We need to partner with our business stakeholders and regularly discuss how well we as an organization are achieving our business goals. We must endeavor to improve the experience of our most external users first, and prioritize our efforts through that lens.

We should already be doing this in the definition of our SLIs and SLOs (Service Level Objectives), to some extent. SREs and software engineers delivering new features should be working with our Product teams to align on what those metrics should be, and what achievable expectations for them are based on our previous history. What I’m talking about is at an even higher level, where we are partnering with our organizational leaders and executive stakeholders to understand the strategies and capabilities they are enabling to grow the business. Only in that way can we truly measure the success of our efforts, and the value that they drive for the organization. We need to think about the big picture, and align our efforts to those of the entire company.

I’m fortunate to be able to talk to people who have very diverse viewpoints about what SRE is. While I am a strong believer in the Google books and processes, I also acknowledge that others have different perspectives that can positively influence my viewpoints. Not everyone wants to be (or can be) Google, and we should craft SRE Charters and strategies for them that meet their needs without being dogmatic. There are many approaches people can take based on their organization’s priorities, and SRE is a flexible enough concept that we can apply it in a way that makes sense specifically to them.

One area where I frequently find myself doing this is in the terminology itself. I see tremendous value in using a common language for SRE so that we all clearly understand what we’re talking about, similar to a Ubiquitous Language in Domain Driven Design. In the domain of SRE, Google has laid out a well-defined taxonomy for our space. But I truly do not care if someone uses the exact terms so much as follows the best practices that they embody. As an example, if a team does not have SREs working with them, and they’re getting lost in the terminology around the concepts I want to introduce, I’m perfectly fine calling SLIs “core metrics” for a service, so long as they reflect 2–3 user/customer-relevent monitored values that show how well we are servicing them. We can follow the principles and build the SRE capabilities without using the explicit terminology, because it only matters to me that we drive better outcomes.

My point is, let’s focus on driving positive business outcomes rather than the minutiae and details of what we’re trying to do. In this way, we drive more business value for our organizations and improve our ability to deliver services that support it.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.