Hope Is Not a Strategy

This is the first of a series of articles I intend to write and publish, each one drawing from a part of the SRE book.

I have been a Site Reliability Engineer here at Google for 6 years. I feel like I’ve learned a lot, and still have lots to learn about how to run large systems, how to be an SRE, and how Google works. The opinions stated here are my own, not those of my company.

The SRE book was authored by my colleagues at Google, to try and give an accurate picture of what the SRE role is, where it came from and the lessons Google learned. The book was released with a BY-NC-ND Creative Commons license, so if you’re interested you can go and read the entire thing (DRM-free of course).

For this series of articles I will be quoting directly from the text of the SRE book, and adding my own commentary. Starting immediately with Chapter 1: The Introduction:

Hope is not a strategy.”
-Traditional SRE saying
It is a truth universally acknowledged that systems do not run themselves. How, then, should a system — particularly a complex computing system that operates at a large scale — be run?

Yes, it is actually a traditional SRE saying. With my work friends the challenge/response version of it is “But it is a service that provides a packet interface from production to the Internet.” Because of the great love of pedantry, language jokes and choosing terrible names for our products.

We say “Hope is not a strategy” when we mean: We need to apply best practices, instead of just letting software and new features launch and trusting that it will be successful. We use it to call out anyone who is letting something happen (such as a launch or running a system) without applying the proper principles and best practices.

The Sysadmin Approach to Service Management

Historically, companies have employed systems administrators to run complex computing systems.
This systems administrator, or sysadmin, approach involves assembling existing software components and deploying them to work together to produce a service. Sysadmins are then tasked with running the service and responding to events and updates as they occur. As the system grows in complexity and traffic volume, generating a corresponding increase in events and updates, the sysadmin team grows to absorb the additional work. Because the sysadmin role requires a markedly different skill set than that required of a product’s developers, developers and sysadmins are divided into discrete teams: “development” and “operations” or “ops.”

As a disclaimer: I have never actually worked as a systems administrator, but I have often had ops duties attached to my work as a developer. I never self-identified as ‘devops’.

The sysadmin model of service management has several advantages. For companies deciding how to run and staff a service, this approach is relatively easy to implement: as a familiar industry paradigm, there are many examples from which to learn and emulate. A relevant talent pool is already widely available. An array of existing tools, software components (off the shelf or otherwise), and integration companies are available to help run those assembled systems, so a novice sysadmin team doesn’t have to reinvent the wheel and design a system from scratch.

I guess the key point here is that sysadmin teams don’t necessarily have to design new systems from scratch. I can see a sysadmin simply picking from the established and well documented solutions that currently exist.

To contrast with the work I do in SRE, especially at Google, the tools and systems both have to be established “from scratch,” and there is certainly a lot of on the job learning required to be able to understand what makes or breaks a system that has to scale to a large degree.

The sysadmin approach and the accompanying development/ops split has a number of disadvantages and pitfalls. These fall broadly into two categories: direct costs and indirect costs.
Direct costs are neither subtle nor ambiguous. Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.

Have you ever been sold on a different tech solution with phrases like: “Serving system X requires 1 admin per 5 servers, but system Y is being run with 1 admin per 20 servers!”

Thinking about that time in my career makes me cringe a little. I strongly believe that a healthy approach to systems will result in the number of people required to keep the system running constant, without it increasing simply because there are more users.

The indirect costs of the development/ops split can be subtle, but are often more expensive to the organization than the direct costs. These costs arise from the fact that the two teams are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability. The split between the groups can easily become one of not just incentives, but also communication, goals, and eventually, trust and respect. This outcome is a pathology.

“This outcome is a pathology.” This makes me think about The BOFH, stories about whom I have read since I was a teenager, thinking about how c0ol it was for this guy to mess around all day in his computing cave, filled with righteous anger about all users and developers who might dare disturb him.

As an adult I know that the BOFH is a totally dysfunctional figure, and it’s still interesting to see echos of the world described in those stories has solid parallels even today.

Traditional operations teams and their counterparts in product development thus often end up in conflict, most visibly over how quickly software can be released to production. At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.

The tension between “It works, Ship It!” and “Don’t change anything!” is real. This can provoke high tempers and sleepless nights.

Both groups understand that it is unacceptable to state their interests in the baldest possible terms (“We want to launch anything, any time, without hindrance” versus “We won’t want to ever change anything in the system once it works”). And because their vocabulary and risk assumptions differ, both groups often resort to a familiar form of trench warfare to advance their interests. The ops team attempts to safeguard the running system against the risk of change by introducing launch and change gates. For example, launch reviews may contain an explicit check for every problem that has ever caused an outage in the past — that could be an arbitrarily long list, with not all elements providing equal value. The dev team quickly learns how to respond. They have fewer “launches” and more “flag flips,” “incremental updates,” or “cherrypicks.” They adopt tactics such as sharding the product so that fewer features are subject to the launch review.

I think you can quite easily call this another pathology: any two teams that have to fight one another to achieve their goals will result in this.

It’s not necessarily a pathology that everyone encounters. Often there are ways to be mutually beneficial, and provided the ops load is not too heavy, and the quality of the software is sufficiently high, it can be very productive!

As the SRE book goes on, we’ll read about positive ways for these two groups to engage with each other, but not in direct opposition: what the dev side really wants is to iterate on an excellent product, and the ops side wants it to run smoothly without interruption. Both can be achieved.