SRE Strategy

Jamie Allen
Site Reliability Engineering Leadership
5 min readJan 27, 2021

I received an interesting question via LinkedIn about SRE charters last month, and I thought it was worth writing about. In this particular case, the reader was interested in what a new team should look like, and how to define an SRE Strategy document. If you’re interested only in the second part, scroll past this section.

How do I start my SRE team?

To answer the first part, the unsurprising answer is that it depends on the needs of an organization. If your immediate priority is to solve a major systems problem affecting your reliability, you need a specific kind of visionary engineer to affect a solution. Hiring someone who is good at fixing things may help address the problem in the short term, but it won’t necessarily lead to the long-term outcomes that you are seeking.

As an example, imagine you are a very large company deploying many images across a massive global footprint every day via continuous deployment. The images may be sizable, and every time you make an update, you have to stage and deploy those images to all relevant hosts around the world. This has a cost in terms of network, and may be unfeasible when you get above a certain scale. A visionary could look at this problem and see the opportunity to leverage something like BTRFS to only download image diffs, thus reducing network traffic and latency of deployments. This is how Facebook is tackling the problem.

If, on the other hand, you need someone who is a wizard at tackling the holes in your team’s observability and giving new ways to reduce your team’s MTTR, you will want a different kind of archetype, an observer. If you need someone who is good at plugging holes in the system at real time so you can devise longer term fixes, you need a fixer. And if you need someone who can manage the communications between your software engineers and systems administrators, you need a facilitator.

Once you’ve tackled that immediate issue, follow the steps in my blog about adopting SRE. But always remember to be data-driven in your discussions about how to proceed and prioritize.

SRE Strategy

This question interested me a great deal, because a lot of people don’t think about SRE as an organizational strategy, but a tactical approach to address an issue. The first thing an organization has to think about is how SRE maps to their own processes and governance, and reconstruct the Google SRE model to fit their needs.

MTTF/MTTR historical tracking, SLIs/SLOs and observability, and Toil Budgets are all table stakes, the bare minimum you need to be doing to be an SRE organization. From there, you need to make decisions about how you will operate as a team:

  • Will you have SLAs, and if so, what is the purpose of them to your users? What happens if you break them?
  • Will you adopt Error Budgets and invert your team automatically towards reliability work (increasing MTTF) when an SLO is broken, or will you make it a decision point for product and engineering to collectively decide whether that is the right call? If you do the latter, is there another threshold at which reliability must be the focus, regardless of what features have been promised by some specific date? If you do adopt them, will you take unused Error Budget time for a month and use it to take yourself out of production intentionally, as Google recommends?
  • Will your organization adopt a percentage of time that your SREs must spend on coding work as opposed to reliability work, and if so, what will that percentage be? How will you measure it, and what will you do if that commitment to the engineers is not being met?
  • Will you adopt Chaos Engineering? If so, will you do it in test environments, or in production? Will you run it all of the time, or only when the team is running a drill?

Your SRE Strategy Must Be Revisited

Any well-run organization should have strategic goals and OKRs that define how you will conduct business in the next 3–5 years, and from that, a Digital Strategy needs to be defined. Out of that Digital Strategy, several offshoots should be clarified to meet it, such as the Data Strategy and Cloud Strategy. There should also be an SRE Strategy.

Here’s an example of what I’m talking about. Imagine an airline has a rewards/miles program that doesn’t allow customers to redeem on partner airlines. This is a major gap compared to their competitors, and given the measurable value in customer loyalty of a strong Rewards program, the airline crafts a business strategy for implementing a revised Rewards program that supports partner redemptions within 3 years. This sets in motion multiple other parts of the organizations to meet that goal, such as marketing, product, and technology. Together, a Digital Strategy is devised to meet that organizational directive, which will define the who, what, where and how of that effort (why and when are knowns).

  • Who: Which specific teams will be tasked (or formed) to do specific aspects of this work?
  • What: What all work do we envision, such as new services and integrations, to support this effort?
  • Where: How will we deploy this, and for what markets?
  • How: Will we need new hosts in our data center, or will we build this in the cloud?

Based on those decision, the organization can begin crafting a Data Strategy and a Cloud Strategy — what cloud provider(s) will we use, how much we will leverage managed services versus deploying our own platform of tools, etc.

Similarly, the SRE group should think about how meeting the Digital Strategy affects them — does it define requirements that impact the operational profile of their services? Does it require that services are deployed in new environments that will require additional tooling support to fit within their existing tool chain? Are programming language and database decisions being made that will require new expertise on the team? The SRE Strategy should reflect those new needs.

Every organization should be intentional about the ways they build products and go to market. The same is true for the engineering groups that support those efforts, and SRE is no different. Have well-understood documents/wikis that explain how it is you will approach SRE, and everyone on your team will know exactly what to do in any situation they come across. And if that isn’t true when a situation arises, use that as an opportunity to further augment your strategy. This empowers us as engineers to make decisions on the fly, in the heat of the moment, without having to ask leadership or peers what they should do.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.