SRE Governance

Jamie Allen
Site Reliability Engineering Leadership
11 min readOct 7, 2020

--

It’s hard to imagine a major organization that would deploy production databases that doesn’t have formal rules and standards about how that must be done. Governance represents the processes that we put in place to make sure we are doing the right things at the right time, and prevent anyone from potentially causing issues by circumventing them.

SREs are a unique group in the world of computers/systems/software engineering, in that we aren’t always held to the same standard of governance that you would expect for other areas. I’m unsure how many companies that implement SRE approaches and principles put methodologies in place to maximize the value and ROI of having done so. The one area of governance in SRE that is prevalent is the idea of a Production Readiness Review (PRR), where software to be deployed into production is reviewed to ensure that it meets the “must-have” criteria defined for that organization. But what other areas of governance should exist?

There is no standard answer to this question to help leaders measure the impact of SREs and ensure they’re doing the right things, but here are some activities (including PRR information) that I recommend to get you started.

SRE Center of Enablement (CoE)

SRE leaders in an organizational should identify key leaders with the engineering skill and collaboration EQ to form a group responsible for ensuring that alignment exists across the enterprise. Some organizations already have groups like this, and may call it a Center of Excellence or something like that. These individuals will lead the effort to ensure that teams adopt standardized approaches to SRE, so it is important that the group is comprised of a diverse set of voices across monitoring, automation, capacity planning, incident management, and program management (yes, program management, so that someone can track and report on the activities and outcomes of the group). This group will engage in activities such as the ones I identify below:

SLO Review

  • Who is reviewed: Any team that owns a service/application/product/platform in production
  • Who reviews them: SRE Center of Enablement members and senior engineering leaders from each team
  • When: Every six months

As organizations adopt more and more SRE principles, it’s important that they ensure that similar services adopt similar metrics for apples to apples comparisions. SLO Review will ensure that knowledge about why specific SLOs that were adopted is shared among teams, and will inform the decisions of other teams with similar considerations.

It is important for teams to make sure that the SLO discussion is not ignored or considered low priority just because most services are request/response-based and have similar SLOs. Think about the tiering of services by Tier Zero and higher, where the priority should influence the values of these SLOs and the importance of meeting them.

Production Readiness Review (PRR)

  • Who is reviewed: Any team that owns a service/application/product/platform in production
  • Who reviews them: SRE Center of Enablement members and senior engineering/architecture leaders
  • When: Every two years (at least that often)

Let’s start with what is known and standardized. The Google SRE book is prescriptive about having these, and even gives a few different kinds of PRRs that represent different stages of maturity or the role of the software being deployed. It should contain the base information that all SREs care about, such as SLIs, SLOs, SLAs, the 4 Golden Signals, etc. If you need information about what those are, please see my video presentation about What is SRE? Or just review the slide deck.

This is a good start, but while it gives a few examples of the kinds of items that you should consider for your own PRR checklist, it’s not comprehensive nor tailored to your organization’s priorities and needs. Every organization intending to adopt this approach must compile what they consider absolute “must-haves” for any system that will go into production, whether or not that involves compliance rules specific to your industry. If you’d like a quick start of best practices to start from, I recommend looking at this excellent Production Readiness Checklist from the team at Gruntwork. This kind of checklist can be automated into a compliance and validation platform in your continuous delivery/deployment pipeline.

What’s important to know about the PRR is that it is a team-level endeavor. Senior engineers and managers from multiple teams should attend this and provide feedback about the work the team has done before going live in production. But the PRR itself should also be a living document that evolves over time, and teams should expect to regularly undergo this review process, at least once every 2 years. Even mature teams that have been in production for years can learn as the PRR evolves.

Also, as a team endeavor, this effort does not give visibility into the overall success of your SRE investment across the organization. For that, you need additional organization-level activities that will provide overall feedback about the return on your investment.

When a PRR is to be applied to a specific service or application, send the checklist to the team one month in advance and ask them to prepare their answers for the review meeting. The meeting will likely be 2 hours in length, and should be held in a blameless fashion. Remember that legacy systems and those in sustainment mode may struggle to keep up with changing organizational requirements for production. This effort is not meant to make anyone feel like their not doing their job, but to help identify and prioritize where production requirements should be relative to where they are.

PRR Definition and Evolution

  • Who is reviewed: The Production Readiness Review contents
  • Who reviews them: SRE Center of Enablement members and engineering/architecture leaders in the organization
  • When: Every six months

The PRR to be used by each team should not be defined once and then ignored. New technologies will be adopted, new approaches and processes will emerge, and best practices will change. As a result, it is important that the SRE leaders within the organization allocate time to revisit the PRR contents and ensure that the checklist items are still relevant, and consider new items to include.

Note that large enterprises typically have more than one PRR, depending on the group/org within. For example, microservices shouldn’t follow the same PRR as an analytics team, or a storage group, or platform team. There may be some core checklist items in each, but they should have mission-specific criteria as well. Create several that suit the needs of your organization, but don’t create dozens.

Blameless MTTF/MTTR Review

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production
  • Who reviews them: SRE Center of Enablement members and engineering leads of each team
  • When: Every three months

Each team that deploys to production must compile historical information about Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR). If it is not historically available from existing tools/data, then they must begin tracking it going forward. This will establish a baseline of performance from which trending information can be conveyed, allowing teams and their leadership to understand where they were, where they are now, and how they are trending going forward. This is different from Incident Review, a must have for any organization deploying systems to production, sitting at a higher level of visibility, making clear how successful you are at adopting SRE and how much it is positively impacting your operational performance.

One very important cultural aspect of this approach is that the collection and usage of this data must be for good, not evil. Many organizations use data about incidents to blame people involved. All information gathered needs to be collected at a team level, not by individual names or IDs. And when anyone in the room asks a question about who did what, the answer needs to reflect the responsibility of the entire team, not any one person or role. This is easier said than done, as I’ve seen organizations that are supposedly blameless mention “an intern” in an incident review. Imagine how the hiring review process might be affected for that group when they graduate from university.

These reviews need to be held on a quarterly basis, both internally to each team by the managers supporting them, and then upward to the group leadership. When done properly, everyone gets a picture of whether or not SRE is helping the organization address how often failure occurs, and helping teams mitigate incidents more quickly. If yes, that’s great. If not, the question needs to be asked of why the numbers aren’t improving, without assessing blame to the situation. There are always “reasons,” and many of them represent the priorities the organization has chosen about addressing production availability issues relative to feature delivery. That’s all okay, but it’s important that this process help make everyone aware of those decisions, and the impact of them.

Another important thing to keep in mind is that while we want our reviews to be blameless, that does not mean there should not be accountability. Every team needs to show improvement, or have valid reasons for why improvement is not occuring or why things are getting worse. It is incumbent upon leadership to address these reasons, and make decisions about whether or not they are still valid. If those reasons are not valid, and leadership has effectively communicated that the reasons are not valid to the team and given them the mandate and time to address them, why is no improvement being seen months after the fact? Again, there may be valid reasons for that, but people do still need to be accountable for improvement unless it is unreasonable to assume that improvement is possible. And if that is the case, new approaches need to be investigated.

Teams should also track the Cost To The Business metric for incidents, especially those with severities that are revenue-impacting. I will be posting a new article in early 2024 about MTTx and how to measure them, stay tuned. It’s in draft status and being reviewed at this time.

Tools Standardization

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production that use core tools to automate, deploy, and observe their systems
  • Who reviews them: SRE Center of Enablement members and the engineering leads of each team
  • When: Every six months

Most large enterprises have a sizable number of tools they use for automation, monitoring/observability, platform services, and more. The SRE Center of Enablement should be thinking about how to standardize these toolchains for cost management and better operational alignment.

These discussions can be very political in nature, as it is also typical that engineers and groups have favorite tools for various reasons and specific needs. That should not be discounted; specific value and needs shouldn’t be discarded for sameness just for the sake of having everyone use the same tools. These discussions should capture the reasons tools exist and make that information less tribal in nature. The goal should be to align on a core set of tools that meet something akin to the 80/20 rule — if you don’t have special needs, ensure that teams are using the same tools and in the same way. This will support easier movement between teams, as well.

System Critical Dependency Review

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production
  • Who reviews them: SRE Center of Enablement members and the engineering leads of each team
  • When: Every six months

Another regular leadership activity that should take place is the review of systemically critical dependencies in the system. Technical leadership, both managers and top engineers, should meet and review infrastructure and services to identify what components are critical path across multiple other services and solutions. This activity allows the team to perform three important tasks:

  • Understand who “owns your uptime,” and have a clear understanding of the architectural choices that led to this situation. Is it something your team can live with? Has it had critical outages that have affected your own availability, and by how much? Is this a Single Point of Failure (SPOF), and are there redundances that do not exist that should? Are there mitigation plans in place to deal with an outage in these systems? And does everyone in the room who has a dependency on this system know how to identify and communicate when they are experiencing an incident with this tool?
  • Put together mitigation plan of attack for this dependency, where you assign certain people or teams to tackle tasks that could help address the situation. If this has already been done, review how successful the mitigation plan has been, and possibly go in a new direction or amend the plan and assign new tasks if need be.
  • The very existence of this quarterly review helps attack the curse of tribal knowledge, and helps everyone understand how systems work collectively. This becomes important as new changes are made, such as when new infrastructure is brought online and critical dependencies may not be deployed physically close to that infrastructure. Or when having to make changes to maintenance strategies and plans.

For this approach to be successful, all of the teams that manage critical dependencies for other systems, such as platforms and databases, must adopt a service delivery mindset. See my post about tiered SLAs and how that can help provide the right level of service for each user.

Every six months is a higher frequency than I initially considered for this activity. The first meeting is the hardest, as critical dependencies are first identified and reasoned about. But the follow on meetings where new dependencies are considered, and progress against tasks that should mitigate risk are discussed, are just as important, if not as long.

Toil Review

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production
  • Who reviews them: SRE Center of Enablement members and the engineering leads of each team
  • When: Every three months

This is another quarterly review for leadership and senior engineers, where the SRE backlog for each team is discussed. The reason for making this a meeting is for everyone to listen and understand the toil each other team is looking to address, in case there are commonalities in their own backlogs. Everyone should be looking for efficiencies and areas where the organization can collectively improve, not one team at a time. This meeting doesn’t have to take a long time, but every team should present their backlog, and highlight similarities as they go.

As always, none of these activities are worthwhile if leadership doesn’t demonstrate their importance by actively participating and encouraging other leaders to show up and contribute. Just like Incident Review, if you don’t make it clear that this is important to you, the people you support will also adopt this attitude, and you will get none of the value for which you are heavily investing. Make it a priority to attend and actively contribute to the conversations in each of these activities.

What Does a Governance Activities Calendar Look Like, and Who Performs Them?

A calendar of monthly activities should be defined where all CoE members meet and perform the scheduled work. The first time that they occur (especially the critical dependency reviews) will be arduous, but they will be easier from that point forward. Also, align the schedule to your fiscal calendar, so that the MTTx Reviews take place after a fiscal quarter ends, which will help with the Cost to the Business metric tracking.

For an organization whose fiscal calendar aligns to the calendar year/end, an example of a governance schedule could look like this:

  • January: MTTx Review
  • February: Toil Backlog Review
  • March: Critical Dependency Review
  • April: MTTx Review
  • May: Toil Backlog Review
  • June: Tool Standardization, PRR Evolution Review, and SLO Reviews
  • July: MTTx Review
  • August: Toil Backlog Review
  • September: Critical Dependency Review
  • October: MTTx Review
  • November: Toil Backlog Review
  • December: Tool Standardization, PRR Evolution and SLO Reviews

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.