SRE Governance

It’s hard to imagine a major organization that would deploy production databases that doesn’t have formal rules and standards about how that must be done. Governance represents the processes that we put in place to make sure we are doing the right things at the right time, and prevent anyone from potentially causing issues by circumventing them.

SREs are a unique group in the world of computers/systems/software engineering, in that we aren’t always held to the same standard of governance that you would expect for other areas. I’m unsure how many companies that implement SRE approaches and principles put methodologies in place to maximize the value and ROI of having done so. The one area of governance in SRE that is prevalent is the idea of a Production Readiness Review (PRR), where software to be deployed into production is reviewed to ensure that it meets the “must-have” criteria defined for that organization. But what other areas of governance should exist?

There is no standard answer to this question to help leaders measure the impact of SREs and ensure they’re doing the right things, but here are some activities (including PRR information) that I recommend to get you started.

Production Readiness Review (PRR)

  • Who is reviewed: Any team that owns a service/application/product/platform in production
  • Who reviews them: Organizational management and senior engineering leaders
  • When: Every two years (at least that often)

Let’s start with what is known and standardized. The Google SRE book is prescriptive about having these, and even gives a few different kinds of PRRs that represent different stages of maturity or the role of the software being deployed. It should contain the base information that all SREs care about, such as SLIs, SLOs, SLAs, the 4 Golden Signals, etc. If you need information about what those are, please see my video presentation about What is SRE? Or just review the slide deck.

This is a good start, but while it gives a few examples of the kinds of items that you should consider for your own PRR checklist, it’s not comprehensive nor tailored to your organization’s priorities and needs. Every organization intending to adopt this approach must compile what they consider absolute “must-haves” for any system that will go into production, whether or not that involves compliance rules specific to your industry. If you’d like a quick start of best practices to start from, I recommend looking at this excellent Production Readiness Checklist from the team at Gruntworks. This kind of checklist can be automated into a compliance and validation platform in your continuous delivery/deployment pipeline.

What’s important to know about the PRR is that it is a team-level endeavor. Senior engineers and managers from multiple teams should attend this and provide feedback about the work the team has done before going live in production. But the PRR should also be a living document that evolves over time, and teams should expect to regularly undergo this review process, at least once every 2 years. Even mature teams that have been in production for years can learn as the PRR evolves.

Also, as a team endeavor, this effort does not give visibility into the overall success of your SRE investment across the organization. For that, you need additional organization-level activities that will provide overall feedback about the return on your investment.

Blameless MTTR/MTTF Review

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production
  • Who reviews them: Organizational leadership and the peers of each team
  • When: Every three months

Each team that deploys to production must compile historical information about Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR). If it is not historically available from existing tools/data, then they must begin tracking it going forward. This will establish a baseline of performance from which trending information can be conveyed, allowing teams and their leadership to understand where they were, where they are now, and how they are trending going forward. This is different from Incident Review, a must have for any organization deploying systems to production, sitting at a higher level of visibility, making clear how successful you are at adopting SRE and how much it is positively impacting your operational performance.

One very important cultural aspect of this approach is that the collection and usage of this data must be for good, not evil. Many organizations use data about incidents to blame people involved. All information gathered needs to be collected at a team level, not by individual names or IDs. And when anyone in the room asks a question about who did what, the answer needs to reflect the responsibility of the entire team, not any one person or role. This is easier said than done, as I’ve seen organizations that are supposedly blameless mention “an intern” in an incident review. Imagine how the hiring review process might be affected for that group when they graduate from university.

These reviews need to be held on a quarterly basis, both internally to each team by the managers supporting them, and then upward to the group leadership. When done properly, everyone gets a picture of whether or not SRE is helping the organization address how often failure occurs, and helping teams mitigate incidents more quickly. If yes, that’s great. If not, the question needs to be asked of why the numbers aren’t improving, without assessing blame to the situation. There are always “reasons,” and many of them represent the priorities the organization has chosen about addressing production availability issues relative to feature delivery. That’s all okay, but it’s important that this process help make everyone aware of those decisions, and the impact of them.

Another important thing to keep in mind is that while we want our reviews to be blameless, that does not mean there should not be accountability. Every team needs to show improvement, or have valid reasons for why improvement is not occuring or why things are getting worse. It is incumbent upon leadership to address these reasons, and make decisions about whether or not they are still valid. If those reasons are not valid, and leadership has effectively communicated that the reasons are not valid to the team and given them the mandate and time to address them, why is no improvement being seen months after the fact? Again, there may be valid reasons for that, but people do still need to be accountable for improvement unless it is unreasonable to assume that improvement is possible. And if that is the case, new approaches need to be investigated.

System Critical Dependency Review

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production
  • Who reviews them: Organizational leadership and the peers of each team
  • When: Every six months

Another regular leadership activity that should take place is the review of systemically critical dependencies in the system. Technical leadership, both managers and top engineers, should meet and review infrastructure and services to identify what components are critical path across multiple other services and solutions. This activity allows the team to perform three important tasks:

  • Understand who “owns your uptime,” and have a clear understanding of the architectural choices that led to this situation. Is it something your team can live with? Has it had critical outages that have affected your own availability, and by how much? Is this a Single Point of Failure (SPOF), and are there redundances that do not exist that should? Are there mitigation plans in place to deal with an outage in these systems? And does everyone in the room who has a dependency on this system know how to identify and communicate when they are experiencing an incident with this tool?
  • Put together mitigation plan of attack for this dependency, where you assign certain people or teams to tackle tasks that could help address the situation. If this has already been done, review how successful the mitigation plan has been, and possibly go in a new direction or amend the plan and assign new tasks if need be.
  • The very existence of this quarterly review helps attack the curse of tribal knowledge, and helps everyone understand how systems work collectively. This becomes important as new changes are made, such as when new infrastructure is brought online and critical dependencies may not be deployed physically close to that infrastructure. Or when having to make changes to maintenance strategies and plans.

For this approach to be successful, all of the teams that manage critical dependencies for other systems, such as platforms and databases, must adopt a service delivery mindset. See my post about tiered SLAs and how that can help provide the right level of service for each user.

Every six months is a higher frequency than I initially considered for this activity. The first meeting is the hardest, as critical dependencies are first identified and reasoned about. But the follow on meetings where new dependencies are considered, and progress against tasks that should mitigate risk are discussed, are just as important, if not as long.

Toil Review

  • Who is reviewed: All teams in a group that operate a service/application/product/platform in production
  • Who reviews them: Organizational leadership and the peers of each team
  • When: Every three months

This is another quarterly review for leadership and senior engineers, where the SRE backlog for each team is discussed. The reason for making this a meeting is for everyone to listen and understand the toil each other team is looking to address, in case there are commonalities in their own backlogs. Everyone should be looking for efficiencies and areas where the organization can collectively improve, not one team at a time. This meeting doesn’t have to take a long time, but every team should present their backlog, and highlight similarities as they go.

As always, none of these activities are worthwhile if leadership doesn’t demonstrate their importance by actively participating and encouraging other leaders to show up and contribute. Just like Incident Review, if you don’t make it clear that this is important to you, the people you support will also adopt this attitude, and you will get none of the value for which you are heavily investing. Make it a priority to attend and actively contribute to the conversations in each of these activities.

--

--

--

Recommended from Medium

How to Deploy a High Availability Web App to AWS ECS

Create a PowerPoint Presentation Using Python

Execute a stored procedure that gets data from multiple tables in EF core

Photo by Josh Howard on Unsplash

MediaPipe Python Tutorial [How to Install + Real-Time Hand Tracking Example]

Running Docker Inside and Outside of a Jenkins Container, Along With Docker-Compose — a Tiny…

BDD: Writing an Automated Test Suite Isn’t Rocket Science

Making WordPress faster —

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jamie Allen

Jamie Allen

SWE & SRE leader. Former vendor, consultant, startup and Fortune 150 executive. Author of Effective Akka, co-author of Reactive Design Patterns.

More from Medium

SRE Bytes: The Four Golden Signals of Monitoring

Seeking SRE

Why Kubernetes? A quick take on the rationale behind adopting Kubernetes

Kiali: Manage, visualize, validate and troubleshoot your Service mesh!