Google’s Approach
I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.
Continuing with: Google’s Approach to Service Management: Site Relaibility Engineering.
Conflict isn’t an inevitable part of offering a software service. Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.
I feel like this is underselling what good system administrators do. A good sysadmin does write software to accomplish what their peers might do manually. The idea of sysadmins replacing people with small shell scripts is not new!
Keeping an open mind here: I think it’s important to think about this in terms of a contrast and comparison, not painting things as black and white.
I do interview Site Reliability Engineers at Google. Quite often we interview and hire people who self describe as System Administrators, because they have the requisite skills and experience. But sometimes we see a resume, go to interview the candidate, and it turns out that their day to day roles are mostly about putting install-media in new servers, and rotating backup tapes. We want the other kind: who craft automated network installers and send verifiable incremental snapshots over the network.
What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering. So I designed and managed the group the way I would want it to work if I worked as an SRE myself. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.
Here the author is speaking in the first person about his own experience. This is Ben Treynor Sloss, who is in my own reporting chain. He is a very hands-on, technical person who does care deeply about how the organisation acts and behaves.
A primary building block of Google’s approach to service management is the composition of each SRE team. As a whole, SREs can be broken down into two main categories.
50–60% are Google Software Engineers, or more precisely, people who have been hired via the standard procedure for Google Software Engineers. The other 40–50% are candidates who were very close to the Google Software Engineering qualifications (i.e., 85–99% of the skill set required), and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek.
I am the latter here: in our nomenclature I am an ‘Systems Engineer’ not a ‘Software Engineer’. We abbreviate these as ‘SRE-SE’ or ‘SRE-SWE’.
Because of my previous work experiences, where I dealt at least a little bit with every layer of the stack, I ended up having all the keywords on my resume that our recruiters are told to look for for the “useful to SRE but is rare for most software engineers.” Now I am now a Senior Systems Engineer here at Google.
I like being a Systems Engineer. Our job description puts more emphasis on solving problems, and not necessarily writing code to do it. Which is a positive incentive.
Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems. Within SRE, we track the career progress of both groups closely, and have to date found no practical difference in performance between engineers from the two tracks. In fact, the somewhat diverse background of the SRE team frequently results in clever, high-quality systems that are clearly the product of the synthesis of several skill sets.
The result of our approach to hiring for SRE is that we end up with a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated. SREs also end up sharing academic and intellectual background with the rest of the development organization. Therefore, SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
There’s a potential for dysfunction here: without good project management, you end up with too many solutions for similar problems. This is something to watch out for.
When you have folks with software expertise, sometimes they end up writing one-off bespoke solutions that only apply to the job in front of them. Without looking left and right and figuring out what’s important for the whole organisation.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.
When I read about “operational load will grow with traffic” I have to think about that concept for a while to make it even make sense to me.
Imagine if when there was a new software release, it had to be installed on each server by logging into it, installing the new version, restarting the binary. This operational load is fine for a small number of servers. But in the case where there’s thousands of servers, it would take a hands-on, operational approach, a long time to get through it.
The SRE challenge would be to make updating the software of a small service running on just 3 machines to be operationally equally demanding as updating a service running on 1000+ machines in each of 15 datacenters worldwide.
To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs — tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.
I have been on a very mature team where lots of the operational load has been automated away, and also on teams that haven’t had the luxury of years of development.
It is exciting every time we make something that was previously a point of operational pain automatic.
My favorite example here is from very early in my time at Google. Our team had monitoring that would occasionally detect a server that was having Java garbage collection issues. These were ultimately caused by memory management practices, and the only way that our developers could know where the memory was being consumed was to look at heapdumps and analyse where the memory was being used.
So our procedure was to grab a heap dump, file a bug, attach the heap dump, and assign it to the right developer. After 3 of these incidents in the same week, I said exasperated: “Why the heck am I doing this myself, shouldn’t the computer be able to do this?”
By the end of the week, a coworker who had heard my outburst had written some software that if you ran it: would do the above procedure with a single shell command. And a few months later it had expanded in scope to remove the need for humans at all. A service would notice the bad tasks and send the appropriate developer an automated bug with enough data in it to be able to triage and fix the memory problems.
Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development. So how do we enforce that threshold? In the first place, we have to measure how SRE time is spent. With that measurement in hand, we ensure that the teams consistently spending less than 50% of their time on development work change their practices. Often this means shifting some of the operations burden back to the development team, or adding staff to the team without assigning that team additional operational responsibilities. Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service.
One of the great aspects to SRE at Google, and something I like to highlight is: My organisation is separate to our development organisation. We’re peers, and we have to track the reporting chain all the way up to the SVP level.
Ben says “shifting some of the operations burden back to the development team,” where I as an engineer who is getting paged 4 times a day for the same unfixed bug will immediately start the conversation about handing the pager (or just that particular notification) back to the development team.
Our management chain will support us in giving services back, if we can show that we’re providing no additional value and only doing ops work.
We’ve found that Google SRE’s approach to running large-scale systems has many advantages. Because SREs are directly modifying code in their pursuit of making Google’s systems run themselves, SRE teams are characterized by both rapid innovation and a large acceptance of change. Such teams are relatively inexpensive — supporting the same service with an ops-oriented team would require a significantly larger number of people. Instead, the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system. Finally, not only does SRE circumvent the dysfunctionality of the dev/ops split, but this structure also improves our product development teams: easy transfers between product development and SRE teams cross-train the entire group, and improve skills of developers who otherwise may have difficulty learning how to build a million-core distributed system.
I love training our developers on how to run services at Google. We have multiple programs to help teach them how things work: such as giving individual developers a 6 month rotation in an SRE team, with an option to stay at the end, providing specific training classes, providing office-hours for questions, and simply helping out day-to-day.
Despite these net gains, the SRE model is characterized by its own distinct set of challenges. One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small. As our discipline is relatively new and unique, not much industry information exists on how to build and manage an SRE team (although hopefully this book will make strides in that direction!). And once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support. For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
Two distinct points are raised in this paragraph. The first is hiring.
Hiring SRE is difficult. The skills required to be a good SRE are not unusual, but sometimes the combination of the right skills and the right temperament are hard to find. Once our recruiters notice you might be a good fit for SRE, it’s sometimes quite hard to get them to listen to you when you say you just want to write android apps or do frontend web development!
I have done a very large number of SRE interviews: I have a little card pinned to my wall from a friendly recruiter congratulating me on having reached my 100 interviews mark. I know first-hand that even when we get candidates that have the right skills, those skills may fall short of our strict hiring bar.
The best thing to happen recently is a much bigger push to find candidates from more diverse backgrounds. This has resulted in some excellent hires, and really helps push back the “BOFH Sysadmin” image that some people in our industry project.
I also want to address Ben’s second point: Strong management support required for our unorthodox approaches to managing system stability. This is absolutely crucial. Even a new-hire SRE can walk into a meeting with quite senior engineers on our development side and say “There have been too many release rollbacks, if this continues we will stop supporting your service” and get a great deal of respect and positive response, because everyone knows that the organisation, all the way up to the highest levels, will support what that SRE is saying.
I’ve been that SRE several times. Resulting in various outcomes, such as: Giving the service back to the developers to maintain, stopping all new feature development for a quarter to work on stability, and having the public launch of a product delayed for a year. In none of these situations did any pressure come from my from my management chain to override my concerns.
Some parting thoughts from the book, which state things better than I could on the difference between devops and SRE.
DevOps or SRE?
The term “DevOps” emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles — involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks — are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.