Five whys you can’t do SRE and why you can
In the past 11 years, I’ve heard many reasons why some companies can’t or should not have anything to do with Site Reliability Engineering (SRE). Here is a list of 5 reasons why you may believe you can’t start and why you might be misguided:
- We are not Google
- We are not as big as Google
- We are not developing software quite like Google
- We would like to implement DevOps first
- We’ll automate our job away
We are not Google
While Google has given the role a name, shaped it to its needs and then released two books on the subject. It’s pretty clear that several SRE best practices, as they are defined today, have been implemented before Google SRE was founded both in tech companies and elsewhere. For example, blameless postmortems and incident management.
“To ensure that Safety Board investigations focus only on improving transportation safety, the Board’s analysis of factual information and its determination of probable cause cannot be entered as evidence in a court of law.” — NTSB
“ICS consists of a standard management hierarchy and procedures for managing temporary incident(s) of any size. ICS procedures should be pre-established and sanctioned by participating authorities, and personnel should be well-trained prior to an incident.” — ICS
This is when folks tend to tell me “Fine, this is not Google only” and then they may adjust their reasoning to “We are not as big as Google [or a gov’t agency]”.
We are not as big as Google
For the record, Google SRE is not as big as Google. It is a small fraction of the company (headcount wise) and it’s started much smaller. Several other companies have started SRE teams early on as well.
It’s also true that many companies have never named their team or role “SRE” in the past but they have adopted at least a couple of the SRE best practices often unknowingly.
This is the “How do you think I’ve got this job?” part of my argument. Automation, non-abstract large scale system design, RCS for config changes and the understanding that we couldn’t afford 100% uptime neither page on single vserver (!) issues was pretty much standard in my “SysAdmin” team back in 2003! This was 4 years prior to my first Google interview.
The “I was doing parts of SRE before Google SRE was founded” story is not uncommon from what I’ve heard from other SREs.
We are not developing software quite like Google
At this point, folks have the tendency to tell me how SRE is for web related and/or backend related services. The extreme and relatively new example being: “We just have a mobile app, everything else is serverless” and a more common one being “We can’t rollback or freeze”.
This reason can be generally rephrased as “I can’t implement all the SRE best practices as they have been described to me so I won’t look into any of them further”. I’d recommend looking into how SRE approaches incident management, capacity planning, toil and of course error budgets. It may be that on error budget violation you can’t freeze and you can’t rollback to mitigate an outage. That’s ok. We’ve all been there before.
Many of my colleagues have argued that adopting SRE is a journey and a gradual rollout is fine, I agree. You may also uncover more interesting or simply better ways to define and guarantee the level of reliability required for your customers and your business.
We would like to implement DevOps first
This is usually due to a valid prioritization call where folks have decided to automate workflows via CI/CD before staffing a dedicated SRE team in order to start adopting SRE best practices.
My argument in this case is to remind people that SRE is an opinionated implementation of DevOps and that not all products and services at Google have a dedicated SRE team either. With that being said, in the absence of a dedicated SRE, developers (aka Software Engineers) at Google will tend to work as… DevOps without SREs which happen to know and leverage some SRE best practices themselves. Let that sink in… Generally speaking, Google developers stick to blameless postmortem, same incident management process, etc.
In other words, your operators, no matter how you call them, can and should adopt SRE best practices eons before you get to staff your first SRE team. The rationale is simple: It is more cost effective to do so. It is so much more cost effective that the last argument is “We will automate or job away”. You won’t at least not fully but we will cover that in a moment.
I’ll owe you short answers on (a) how to define if a team has implemented enough SRE best practices in order to be called an SRE team and (b) an index of best practices outside the realm of Google’s SRE related books.
We’ll automate our job away
As a developer implementing SRE best practices, you won’t. To be fair, no developer has told me that. There will be plenty of feature development and bug fixes to work on while you are busy managing incidents, implementing error budgets and a more SRE-like approach.
Within some operators groups, I see people confusing automation with elimination of workload altogether. SRE related automation is often, but not always, a replacement of manual labor to allow much more work to be done with a lower failure rate which can be attributed to human error. Building and maintaining the automation is not free neither cheap even for very experienced SREs. Automation and elimination of workloads are different things. SREs tend to aim to scale its positive impact through both but I have never heard an SRE saying that one has to automate everything or most things as the first SRE best practice which blocks adopting any other.
There’s also a whole set of SRE best practices which have less to do with automation. For instance, training for and managing a production incident or reasoning about systems design end-to-end in order to influence and work with developers on their specific components to improve the overall reliability.
As an operator, do you have SLO burn rate based monitoring for your services? Do you have an incident management process and has established training for it? Do you review the overall system design with your developers/vendors/partner teams and is able to influence better outcomes in order to defend the SLO? I’m doubtful that doing any of these would lead to unemployment but if it happens to you, send me a DM on Twitter and I’ll help you out with your resume. There are several companies out there looking for someone with your skills!
In conclusion, I believe that the requirements to start an SRE team or organization within your company often gets conflated with start adopting SRE best practices. You should do the latter first.