The Continuously Evolving Nature of SRE
An alternative title for this post might have been, “Why can’t anyone agree on a singular definition for Site Reliability Engineering (aka SRE)?” I’ve spent some of my career in this space and over time, have become very opinionated about my beliefs. The aim of this post is to share some of those beliefs with the hope that you walk away from reading this considering and/or understanding three things:
- Production Engineering as a natural evolution of Site Reliability Engineering (SRE version 2.0)
- Six core beliefs I have about Production Engineering aka The PE Principles
- Production Engineering constantly evolves and will be exactly what it needs to be for the phase of growth that a company is in
We created the problem that Production Engineering is trying to solve.
Twenty years ago we did not have this notion of massive datacenters, in multiple regions spanning the globe, each containing hundreds of thousands of physical machines, with several multiples more of virtual machines and even more containers (literally millions) spread about. We did not have hundreds or thousands of microservices creating these complex software and service dependency chains all running on top of a somewhat (un)reliable network and hardware intertwined with various third-party providers, APIs, and vendors.
We introduced all of this complexity to provide web scale services to the masses and now we need a way to manage that complexity without drowning under literal armies of engineers or administrators to do it.
Google was the first company to really start operating at internet scale and so it was only fitting that they created the concept of a new type of engineer to help manage this complexity and ensure reliability while doing it: the Site Reliability Engineer. They then further popularized this concept by publishing The Google SRE book.
Couple this with Google open-sourcing and sharing many of its internal technologies: MapReduce (Hadoop), Bigtable (Hbase), Borg (Mesos/Kubernetes) and Chubby (Zookeeper) just to name a few and we have now have this world where all technology companies now want to run their infrastructure exactly like Google.
It is not enough to use the technology, you also need the people who understand how to operate it as well. This is why Site Reliability Engineers are now proliferating across the industry. To manage this complexity, you need a new type of engineer that specializes and thrives in that type of environment to do it.
Just one problem. Google did a great job of describing what you needed and why, but not necessarily how to do it, who to do it and when to do it. The processes that you need to support an SRE organization are still a black art that most companies figure out as they go. And what worked well for Google (hundreds of thousands of employees and millions of computers) may not necessarily work well or even be necessary for your environment, both from a technology perspective but also from a process one.
I think its time to evolve the Google SRE Role into a more general Production Engineering Role that is uniform across companies, recognizes the different types of companies and period of growth they may be in, adjusts and scales accordingly, and will ultimately be understood in the industry the same way that a Software Engineering Role is.
Principle #1: Production Engineering is a necessary function to manage the complexity and operate reliably our large-scale production environments.
Virtual hosts, containers, service-oriented architecture, microservices, cloud infrastructure & services, need I go on? As the environments that we operate our services have grown larger and more complex and the requirements of the business ensure that things always work for our customers, we need a new specific type of role that deals with this reliability and complexity.
Software Engineers or SWEs produce and release features for the business. This is what gives us growth and generates revenue. I’d argue that their primary concern is not reliability but rather end-user functionality delivered and the speed or velocity with which they deliver it.
If SWEs focus on features than then Production Engineer is a counter-balance to that and delivers reliability itself as a feature. The Production Engineer or PE understands the environment in which production services live and partners with SWEs to ensure that this environment is reliable and to help resolve issues when it is not.
In addition, because these production environments are large scale distributed systems, the Production Engineer understands, builds, and operates software that helps manage that complexity. It is the only way to get leverage we need coupled with efficiency at scale.
Principle #2: Most Production Engineers are grown internally and rarely found in the wild.
There are not enough Production Engineers with relevant experience to go around. Let’s just accept that as truth. You will basically be fighting to hire these candidates against other companies who will often be able to compensate talent at a much higher level than you. Even if you could win them over, there are just not enough to go around. Everyone is fighting over the same people.
Let’s just be honest about that, and instead design our hiring pipeline and internal processes to identify people that both want to learn this discipline and also have some type of background that may make them suitable or interested in it. Then train and build them up internally. We should create what we need.
Principle #3: Production engineering work requires both software engineering and systems engineering skills and prioritizes breadth (generalist) over depth (specialist).
Production Engineers are generally the 5–10% of people at a company who understand the most about how everything works. I’d argue that most SWEs are specialists at a company tasked with delivering specific types of features for the teams and products they serve whereas the PE understands both the infrastructure, software, platforms, products, and services required to wire this all together and more importantly how to identify when and where things break.
In order for a PE to truly gain leverage, they must be capable of building and modifying software. They often work with other codebases and systems and need that level of understanding, but to truly operate some of these large scale distributed systems in production they need to be able to automate many of the production and operational processes required for a business to run and dive in deeply to codebases solve complex issues when they arise in production.
Principle #4: The PE Role consists of a production engineering competency with three areas: unix systems, distributed systems, and reliability engineering.
A competency can be described as the knowledge, skills, and abilities in a specific area required to do your job. When we talk about SWEs this usually includes things like architecture & design, coding in a specific language, communication & documentation, etc. All that I’m advocating for here is to add an additional competency for PEs that recognizes that we are specialists that are generalists and to be good at our jobs we need to be able to do everything an SWE can, in addition to understanding Unix, distributed systems, and everything around reliability and quality engineering.
In essence, PEs are being asked to know and do more than just general SWEs to be great at their jobs. Yes, it is unfair so how about we just compensate them more for the trouble :) This leads to the next point …
Principle #5: At a sufficiently senior enough level, the Software Engineer and Production Engineer roles will merge.
A SWE and PE may start out focusing on different things (features versus reliability) but ultimately at senior enough levels in an organization, eventually both roles merge together. I’m not talking about a Senior Software Engineer, think several levels higher, like a Principal Engineer or Fellow (equivalent to a Director or VP in an org).
This is because the best engineers need to understand how everything works. This is not required at some of the lower engineering levels because many things are abstracted away and those can be terminal levels (you can be a senior engineer forever w/o getting promoted), but as you gain more tenure and experience in an organization you should be capable of doing both software engineering and production engineering work, and you should be great at both.
I would never trust a senior engineering leader in any organization that wasn’t because this is the new reality of operating large scale distributed systems and it is not going away.
Principle #6: PE Teams may use a variety of different engagement models that will depend on the needs of the business: centralized team, distributed teams, embedded, consultative, and special projects.
This is one of the biggest challenges for Production Engineering Organizations, accepting the reality that there is not a one-size-fits-all model and that how you operate and engage with the business will morph and change over time. You need to bake into your operating model itself this notion of constant change and evolution and ensure that your partners understand it and the different types of value that you provide at these different phases.
I think the main factor that dictates the particular operating model for a PE Organization is both the number of physical & virtual machines, containers, and services that an organization is supporting tied together with the number of Software Engineers that produce features. As both grow, your model should change as well and if you try to copy the Google SRE Embed Model from day one, you are destined for pain (lessons learned by me).
- Centralized Team: Generally where you start, a single team that provides infrastructure and operations support for the entire organization. Not much automation exists yet because you’re still figuring out the business model and pivoting constantly so you remain highly flexible.
- Distributed Teams: Moving to a model where you have teams of PEs (7–10 members each) that partner with specific critical products or areas of the business to provide support. This allows you time to provide consulting services, solve common problems across the board, but also engage in process-automation-driven-development where you automate many of the operational processes of the organization.
- Embedded: This is the model most are familiar with, where PEs (SREs) are directly embedded into a product team, usually a couple that reports to that team’s manager and are usually the ones tasked with doing any reliability and operational work. This model is tough because smaller to medium-sized orgs (anyone that is not Google) usually do not have enough people to distribute to actually have an outsized impact.
The next two models are ad-hoc models. They occur generally when you want to do something time-bound, and is not generally the way you always want to operate but enables you to solve very specific tactical problems.
- Consultative: Think Reliability Consultants, this is the model you use when you want to come in for a few months, solve some very obvious problems, help a partner team standardize on platforms and infrastructure and follow best-practices, then get the hell out. Generally, someone will scream that they need help, and you drop in these consultants to help.
- Special Projects: Sometimes PE Teams identify a common set of problems across an organization, but there’s no one on the platform, infrastructure or common services+library side actually working on fixing it. So you assemble a tiger team, solve the problem yourself, then ultimately hand off the work generated to another team to maintain long-term and get back to your day jobs. PE Teams should be happy to create things and then give them away.
That was long! Each principle can probably be its own blog post and I will probably elaborate on them in the future. I’ve presented these concepts in other forms over the years across various companies with input from various people so this is an amalgamation of voices and ideas. However, there is still one big challenge in front of us. We need standardization!
If I mention the term Software Engineer, you have a general notion of what I’m talking about. There are some nuances (frontend, backend, full-stack, etc.), but the general principles are understood. I don’t need to spend time when hiring explaining what a Software Engineer or SWE is, you just know.
That is not true for a Site Reliability Engineer. I would like it to be true for Production Engineer and I would love it if the industry got a bit more consistent with not only what this role does but how it operates in an organization so we all don’t need to keep reinventing the wheel.
Now don’t take this as gospel, but I think it’s a great start to a much longer discussion that will ultimately lead to more consistency across the board, and better-compensated Production Engineers!
One thing that I did not touch on is the notion that a Production Engineering Organization may consist of both Software Engineers (SWEs) in addition to Production Engineers (PEs).
The SWEs in a PE Org basically build software that other engineers use like monitoring and observability tools, deployment tools, testing tools, and other abstractions and platforms that help the org gain leverage across engineering.