Site Reliability Engineering is Operations

I saw two emotionally charged opinions on Twitter this week about SRE and Operations. They really made me think.

The first tweet I saw was from @tmclaughbos — where he makes the impassioned case that doing a Kubernetes (EKS) migration is not an SRE activity.

The thing that really made me think was Tom saying “That sounds like just an operations team.”

I am a Site Reliability Engineer at Google, I have run some of the worlds largest applications at scale since 2011, and Site Reliability Engineering does Operations.

Later that week I saw Mar⚡️us write about the dev/ops split in SRE going badly. Mar⚡️us is referring to the principle that an SRE team should spend less than 50% of its time doing ‘Toil’ or repetitive work (SRE Book Chapter 5: Eliminating Toil, “Why Less Toil Is Better”):

And this short story speaks volumes to me. It’s a failure mode of the SRE engagement I’ve not seen this starkly, but I have certainly seen things like it before: The team runs the system, they muck about in the rest of their time not being overly productive, but it looks like work so nobody complains.

In the previous thread with Tom McLaughlin, I pulled out the somewhat grandiose quote from Benjamin Treynor-Sloss:

Tom challenged if migration to EKS (Kubernetes) was an SRE activity because it sounded like an Ops team. Mar⚡️us points out that it’s possible for an SRE team to just do a bad job of operations if they make enough noise about Service Level Objectives. (I’m paraphrasing: actual words are above in the embedded tweets, please read them and the context threads for their stated positions).

Boris the SLO Loris Approves of using Service Level Objectives

My position on this is: Site Reliability Engineering must do Operations, but a well functioning SRE team must do those operations mindfully and with respect to their actual goal, which hasn’t been stated here:

SRE help the entire organisation take appropriate risks.

  • When things are reliable: move faster, take more risks.
  • When things aren’t so reliable: move slower, mitigate problems, take fewer risks.

An EKS migration is a huge risk. If the folks responsible for it measure performance, are mindful of the problem space, and are directly engaged, they are in an excellent position of making sure that the appropriate risks are taken.

This has to be bracketed by what Mar⚡️us was pointing out: an SRE team has to show results beyond their operations. At the end of your reporting period: What projects did the SRE team complete? What toil did they reduce? What’s their plan for the next major risk facing the business, like a migration, new system deployment, or their next 10x userbase?

What do I think about SRE sounding like an ops team?

Site Reliability Engineering do Operations but are not an Operations Team.

Thanks for reading. Please feel free to comment here or chat with me on twitter.