How to Implement SRE In Your Organization

Jamie Allen

Published in

Site Reliability Engineering Leadership

10 min readMay 12, 2020

Updated February 15, 2023

TL;DR

You can implement a lot of the value of SRE without SREs by following these steps:

In the first quarter, ensure you have captured MTTF/MTTR metrics by service/application and roll it up to the organization level, to measure progress as you embark on this effort. Track them to observe the value of your efforts — is MTTF going up, and is MTTR going down?
Next quarter, have teams create toil backlogs and prioritize the work relative to feature delivery; also create your Production Readiness Review baseline (this will evolve over time)
Next quarter, have teams identify SLIs
And in the final quarter, have teams set realistic and achievable SLOs for those SLIs

Within one year, you will have the structure in place to enable SREs that you bring into your teams. And when you do hire SREs, make sure you focus efforts to get the right SRE archetype to meet your priorities/requirements first.

Secondly, focus on adding SREs to the areas most important to your business. Do not try to add SREs to every team — define your p0/Tier1 customer experiences that are most critical to your business, and attach SREs to the service/application teams that support those experiences. When you have limited staff and budget, apply it where it will drive the most impact.

The Full Post

Google’s SRE site has a post on how SRE teams are organized, and how to adopt SRE. I read this post some time ago, and I found myself confused by how they structured it. I think the content is absolutely right, but I had to wrap my head around the way they presented the relationship between the approaches/models. In this post, I’ve tried to parse it into an approach that makes sense to me, and hopefully for you.

There are two primary approaches to how SRE teams are set up within organizations, which the Google article addresses in Item 5 (Embedded) and Item 6 (Consulting).

The embedded model

The Embedded model means that teams are attached to a product/application/service team full-time. They have one or more Software Engineering (SWE) managers, and one or more Site Reliability Engineering (SRE) managers dedicated full time to the success of their product. This can be a winning approach for teams with a very complex implementation, where the reliability work will never quite be done. It also is a good model for teams running critical infrastructure, such as a data storage implementation, or critical underpinning infrastructure being provided as a service to other teams.

The trade-off here is that Embedded SRE teams struggle with their identity. Are they members of the SRE organization, or are they members of the product team they support? The answer is yes to both, which can mean that they never quite feel like they’re part of either. They need support from an organization-wide SRE group with whom they can share ideas and commiserate. They need their own backlog independent of that of the product’s SWE team. And they need the ability to influence the delivery of SWE product features when the Error Budget is broken.

There is an organizational trade-off, as well. If these teams are well-run, they don’t have issues attracting and retaining top SRE talent. Teams that are not well-established, well-run or well-staffed have significant issues attracting and retaining SREs. You can’t easily shift your top people to projects that need the most help because top people are dedicated to specific products, and taking them away means losing their Subject Matter Expert (SME) value. Plus, what makes a good SRE in one functional/product team may not be relevant in another. For example, someone who is an excellent SRE on a service discovery infrastructure team might not have the right skills to be as effective on a storage team.

The embedded model works well in areas where you need dedicated SREs with specific SME skills. It isn’t effective as a model for an entire company because other teams can suffer from its inflexibility.

I know these acronyms are getting confusing, so let’s recap:

SWE: Software Engineer/Engineering

SRE: Site Reliability Engineer/Engineering

SME: Subject Matter Expert

The consultative model

The Consultative model means that you have a core group of SREs who address the most important areas of the business on a priority basis. Your company creates and/or hires a pool of SREs with general SRE skills (able to identify and implement SLIs/SLOs/Error Budgets/Observability and reduce toil). You can then place them on an as-needed basis to different product/application/service teams to provide uplift to them, and then retarget them when they’ve helped that team attain certain goals in key metrics.

This has to be a data driven discussion to work. As an organization, key metrics for team success have to be applied meaningfully. That means you have to define a maturity model that is applicable for each kind of service. For legacy services, it should be the goal to reach certain key levels of maturity, and you tackle them by prioritizing which teams are struggling the most with toil, alert fatigue and outages/incidents. As things smooth out for that team through the application of that minimum set of maturity model capabilities, you can retarget SREs to help other teams.

The benefit here is that you get consistency in implementation across multiple teams, and a cohesive team of SREs who have a sense of “team.” They won’t be SMEs for a specific product, but over time they will have a broader sense of how the entire organization works than those who only work in one area. That has other tangible benefits, particularly when broad architectural decisions are being reconsidered or changed. And, your company benefits by having top problem areas addressed first.

The tradeoff here is that these “general purpose” SREs can struggle in very low-level, systems-oriented teams where specific SME skills matter. Also, there can be friction between SREs and SWE teams with different mandates, so having team buy-in and alignment is very important.

The Product-Centric Model

This is a concept described briefly in the TL;DR above, and not part of the Google approach to adopting SRE. I’ve worked with some organizations who try to “boil the ocean” in their approach to SRE, wanting to hire embedded site reliability engineers for each application and service team. That becomes difficult and expensive to scale. Instead, think about the highest priority customer experiences that you are trying to drive, and dedicate embedded SREs to the service teams that support those experiences. In doing so, you are prioritizing the resilience of those systems that have the highest business value. Scale out from there over time.

Which approach is better?

None of the above is necessarily better, you want the benefits of all of them. Have SRE SMEs dedicated to critical infrastructure groups, and consultant SREs providing temporary help to teams that need assistance meeting the company baseline for operational success.

In each case, you need to define your maturity model so you have a clear expectation of what success in each area looks like. This can also be the basis of your Production Readiness Review checklist. The maturity model should not be overlooked. Just like defining what success looks like in your products, you need similar visibility into your SRE process. See below for information about defining a maturity model.

But, you’ll have to find ways to make your SREs in different capacities feel like part of one cohesive SRE group. You’ll have to organize events and get-togethers that build SRE camaraderie and esprit de corps. Try to avoid having all of these events involve alcoholic drinking, as those aren’t inclusive.

Understand your archetypes

A difficult part of this problem is that you may not be putting the right kind of SREs onto each project team. There are different kinds of SREs, and each of these kinds have value in the right context.

Some companies have adopted archetypes to identify the value proposition of a specific SRE, such as (hat tip to Facebook Production Engineering for some of these archetypes):

The Fixer, someone who is very good at digging into the guts of a system, identifying the root cause of an issue and addressing it
The Automator, someone who loves replacing annoying people/tasks with small scripts; they tackle toil and busywork because they derive enjoyment in streamlining processes
The Facilitator, someone who is exceptionally strong at working with other teams and team members to get things done, frequently someone who builds toil backlogs working with SWEs and product owners; these individuals make good team leads
The Visionary, someone who can has a long-term view of what success could be if only we did X, Y and Z

You don’t want a team that is lopsided with too many of one kind of archetype. Ideally, you’d have at least one of each. Define your archetypes, and have your SREs identify themselves and each other, so that you can make sure you don’t just have a team full of Fixers. They are worth their weight in gold when you need them, but they can’t be the basis of the entire team.

Also, never forget the importance of each role type. When nothing is going wrong, it’s easy to not remember the value of having a Fixer around, because they might not also be very good at automation, addressing toil, etc. If they do not feel valued, they may leave your team/company, and then your team may struggle if an incident occurs afterward.

What should be in my organization’s SRE maturity model?

That will vary across organizations. Start by addressing common areas of concern, such as:

Security
Observability
Availability
Release Engineering
Capacity Planning
Incident Management

Add areas that are specific to your business, or areas that are specific to a kind of team (such as storage). Ask your teams to contribute to the model, and then to prioritize these concerns, as well as the items inside of them.

Should teams be forced to adopt SRE principles?

This is a tough one, because many feel that SREs should only be applied to teams that have asked to be onboarded. For organizations that have very high technical skill like Google, this approach can work. But if your organization isn’t like a Google, this can be a tougher sell. In these cases, the top-down approach is fine. Do it with empathy, as it will be a transition for your teams. And don’t try to do it all at once. See the “How do I adopt SRE if I can’t hire them?” section for how to stage it into your organization.

Should SRE teams be able to disengage from teams?

Absolutely. It is possible that some product teams won’t work with SRE teams in good faith. SRE teams need to be respected and valued by the teams they partner with, and if that’s not happening, don’t force it to. Be particularly cognizant about why this is happening — is it because the product team won’t listen to the SREs, or because the SRE team isn’t listening to the product team, or (most likely) both? It could be a clash of personalities, which happens sometimes. Don’t try to force the issue; pull back the SRE team, and work with the SWEs to help them understand what the organization is trying to achieve and how SREs get them there. Using your maturity model and Production Readiness Review should help with this, as they will see there are things that need to be done that SREs can help them with.

How do I adopt SRE principles if I can’t hire SREs?

This is a tough spot to be in, as SREs are expensive and all of the top-paying companies are vying for their talents. How can a traditional corporation adopt the SRE model?

In these cases, you don’t have hire SRE teams to get the value and benefit of SRE practices. Start by having your teams take one step in a measured timeframe. For example, as part of your teams’ backlogs, task them to identify their top SLIs during a quarter, and then have them define SLOs the following quarter. In the third quarter, have all of them report on how well they’re meeting their Error Budgets in a cohesive way, that gives visibility to your organization. This last one may take more than a quarter, as it may require new infrastructure components for logging/counters/tracing/observability to be implemented, but try to continue the quarterly momentum by establishing clear goals for each quarter. Within a year, you should have measureable visibilty into your organization’s reliability and service health.

If you aren’t meeting your quarterly targets, or if the adoption across your organization is uneven, try to find out what is going wrong and address it. Is it because they’re struggling to identify SLIs? Do they not know how to calculate SLOs? Are they being too optimistic or pessimistic?

Continue to refine by building and measuring your teams with a SRE maturity model, as discussed above. Use that model as a Production Readiness Review for new products being stood up, and as a regular annual checkpoint for legacy services already in production.

What if my organization is “bottom up”?

I have worked in companies claimed they were bottom up. What I typically saw in “bottom up” companies is that while engineers were empowered to define their own backlog and work, if they didn’t do things that the organization saw as “impactful,” they got poor reviews. Managers of those teams were expected to “influence” their teams to do what the organization wanted them to do. Even still, there were occasional top-down mandates, because there absolutely has to be when something important has to be done across the organization (for example, a mandate that every inter-service communication has to be encrypted by a certain date).

If so, make it part of your evaluation criteria. By that, I don’t mean track the number of incidents and reward those with the least. But make it part of your evaluation criteria that teams show measurable improvement in an apples to apples way. And, by being blameless, incentivize top SREs to help teams that need the most assistance by rewarding high impact delivery. This can’t be lip service.

And for teams that are struggling the most, have empathy for them. Don’t consider them failures because they can’t seem to improve, as there may be organizational reasons why they can’t. For example, what if their service/product is being abused by certain users, which is making their availability metrics look poor? Find creative ways to help them, and listen to them when they are trying to communicate their pain. Getting frustrated with them won’t solve the issue.

Conclusion

Hopefully, this post has given you meaningful guidance about how to implement SRE in your organization. Please let me know if you have any feedback or additional ideas.