As I recently wrote on Twitter, I’ve been spending a considerable amount of time lately thinking about the human scalability of “DevOps.” (I use quotes around DevOps because there are varied definitions which I will cover shortly.) I’ve come to the conclusion that while DevOps can work extremely well for small engineering organizations, the practice can lead to considerable human/organizational scaling issues without careful thought and management.
What is DevOps
The term DevOps means different things to different people. Before I dive into my thinking on the subject, I think it’s important to be clear about what DevOps means to me.
Wikipedia defines DevOps as:
DevOps (a clipped compound of “development” and “operations”) is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops). The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. DevOps aims at shorter development cycles, increased deployment frequency, and more dependable releases, in close alignment with business objectives.
I use a more narrow and slightly different definition of DevOps:
DevOps is the practice of developers being responsible for operating their services in production, 24/7. This includes development using shared infrastructure primitives, testing, on-call, reliability engineering, disaster recovery, defining SLOs, monitoring setup and alarming, debugging and performance analysis, incident root cause analysis, provisioning and deployment, etc.
The distinction between the Wikipedia definition and my definition (a development philosophy versus an operational strategy) is important and is framed by my personal industry experience. Part of the DevOps “movement” is to introduce slow moving “legacy” enterprises to the benefits of modern highly automated infrastructure and development practices. These include things like: loosely coupled services, APIs, and teams; continuous integration; small iterative deployments from master; agile communication and planning; cloud native elastic infrastructure; etc.
For the last 10 years of my career, I have worked at hyper-growth Internet companies including AWS EC2, pre-IPO Twitter, and Lyft. Additionally, primarily due to creating and talking about Envoy, I’ve spent the last two years meeting with and learning about the technical architectures and organizational structures of myriad primarily hyper-growth Internet companies. For all of these companies, embracing automation, agile development/planning, and other DevOps “best practices” is a given as the productivity improvements are well understood. Instead, the challenge for these engineering organizations is how to balance system reliability against the extreme pressures of business growth, personnel growth, and competition (both business and hiring). The rest of this post is based on my personal experience, which I recognize may not be applicable to all situations, especially slower moving traditional enterprises that may be used to deploying their monolithic software once per quarter and are being coaxed into more rapid and agile development practices.
A brief history of operating Internet applications
Over the past approximately thirty years of what I would call the modern Internet era, Internet application development and operation has gone through (in my opinion) three distinct phases.
- During the first phase, Internet applications were built and deployed similarly to how “shrink-wrapped” software was shipped. Three distinct job roles (development, quality assurance, and operations) would collaborate to move applications from development to production over typically extremely long engineering cycles. During this phase, every application was deployed in a dedicated data center or colocation facility, further necessitating operations personnel who were familiar with site-specific network, hardware, and systems administration.
- During the second phase, spearheaded primarily by Amazon and Google in the late 90s and early 00s, Internet applications at fast moving hyper-growth companies started to adopt practices similar to the modern DevOps movement (loosely coupled services, agile development and deployment, automation, etc.). These companies still ran their own (very large) data centers, but due to the scales involved, could also start developing centralized infrastructure teams to tackle common concerns required by all services (networking, monitoring, deployment, provisioning, data storage, caching, physical infrastructure, etc.). Amazon and Google, however, never fully unified development job roles (Amazon via the Systems Engineer and Google via the Site Reliability Engineer), recognizing the differing skills and interests involved in each.
- During the third, or cloud native, phase, Internet applications are now built from the ground up to rely on hosted elastic architecture, typically provided by one of the “big three” public clouds. Getting product to market as fast as possible has always been the primary goal given the high likelihood of failure, however in the cloud native era the base technology available “out of the box” allows a rate of iteration that dwarfs what came before. The other defining feature of companies that have started in this era has been eschewing the practice of hiring non-software engineer roles. The available infrastructure base is so relatively robust they reason — I would argue correctly — that startup headcount dollars are better spent on software developers who can do both engineering and operations (DevOps).
The movement towards not hiring dedicated operations personnel in phase three companies is critically important. Although, clearly, such a company does not need full-time system administrators to manage machines in a colocation facility, the type of person who would have previously filled such a job also typically provided other 20% skills such as system debugging, performance profiling, operational intuition, etc. New companies are being built with a workforce that lacks critical, not easily replaceable, skillsets.
Why does DevOps work well for modern Internet startups?
DevOps, as I have defined it (engineers who develop and operate their services), works extremely well for modern Internet startups for a couple of different reasons:
- In my experience, successful early stage startup engineers are a special breed of engineer. They are risk tolerant, extremely quick learners, comfortable getting things done as fast as possible regardless of the tech debt incurred, can often work in multiple systems and languages, and typically have prior experience with operations and systems administration, or are willing to learn as they go. In short, the typical startup engineer is uniquely suited to being a DevOps engineer, whether they want to call themselves one or not (I prefer not 😉).
- As I described above, modern public clouds provide an incredible infrastructure base to build on. Most basic operational tasks of the past have been automated away, leaving a substrate that is good enough to ship a minimum viable product and see if there is product market fit.
- I’m a firm believer that when developers are forced to be on-call and accountable for the code they write, the quality of the system improves. No one likes to get paged. This feedback loop builds a better product, and as I described in (1), the typical engineer attracted to working on an early stage startup product is perfectly willing to learn and do the operational work. This is especially true given that there is often little repercussion for an early startup product having poor reliability. Reliability needs to be just good enough for the product to find market fit and enter the hyper-growth phase.
What happens when a modern Internet startup undergoes hyper-growth?
Most startups fail. That’s the reality. As such, any early startup that is spending a lot of time creating infrastructure in the image of Google is just wasting time. I always tell people to stick with their monolithic architecture and not worry about anything else until human scalability issues (communication, planning, tight coupling, etc.) necessitate a move towards a more decoupled architecture.
So what happens when a modern (phase three) Internet startup finds success and enters hyper-growth? A couple of interesting things start happening at the same time:
- The rate of personnel growth rapidly increases, causing severe strains on communication and engineering efficiency. I highly recommend reading The Mythical Man-Month (which is still largely relevant almost 50 years after its initial publication) for more information on this topic.
- The above almost always results in a move from a monolithic to microservice architecture as a way to decouple development teams and yield greater communication and engineering efficiency.
- The move from a monolithic to microservice architecture increases system infrastructure complexity by several orders of magnitude. Networking, observability, deployment, library management, security, and hundreds of other concerns that were not difficult previously are now major problems that need to be solved.
- At the same time, hyper-growth means traffic growth and the resultant technical scaling issues, as well as greater repercussions for both complete failure and minor user experience issues.
Central infrastructure teams
Almost universally following the early startup phase, modern Internet hyper-growth companies end up structuring their engineering organizations similarly. This common structure consists of a central infrastructure team supporting a set of vertical product teams practicing DevOps (whether they call it that or not).
The reason the central infrastructure team is so common is that, as I discussed above, hyper-growth brings with it an associated set of changes, both with people and underlying technology, and the reality is that state of the art cloud native technology is still too hard to use if every product engineering team has to individually solve common problems around networking, observability, deployment, provisioning, caching, data storage, etc. As an industry we are tens of years away from “serverless” technologies being robust enough to fully support highly reliable, large-scale, and realtime Internet applications in which the entire engineering organization can largely focus on business logic.
Thus, the central infrastructure team was born to solve problems for the larger engineering organization above and beyond what the base cloud native infrastructure primitives provide. Clearly, Google’s infrastructure team is orders of magnitude larger than that of a company like Lyft because Google is solving foundational problems starting at the data center level, while Lyft relies on a substantial number of publicly available primitives. However, the underlying reasons for creating a central infrastructure organization are the same in both cases: abstract as much infrastructure as possible so that application/product developers can focus on business logic.
The fallacy of fungibility
Finally, we arrive at the idea of “fungibility,” which is the crux of the failure of the pure DevOps model when organizations scale beyond a certain number of engineers. Fungibility is the idea that all engineers are created equal and can do all things. Whether stated as an explicit hiring goal (as at least Amazon does and perhaps others), or made obvious by “bootcamp” like hiring practices in which engineers are hired without a team or role in mind, fungibility has become a popular component of modern engineering philosophy over the last 10–15 years at many companies. Why is this?
- As I already described, modern cloud native technology and abstractions allow extremely feature rich applications to be built with increasingly sophisticated infrastructure abstractions. Naturally, some specialist skills such as data center design and operations are no longer required for most companies.
- Over the last 15 years, the industry has focused on the idea that software engineering is the root of all disciplines. For example, Microsoft has phased out the traditional QA engineer and replaced it with the Software Test Engineer, the idea being that manual QA is not efficient and all testing should be automated. Similarly, traditional operations roles have been replaced with site reliability engineering (or similar), the idea being that manual operations is not efficient, and the only way to scale is through software automation. To be clear, I agree with these trends. Automation is a more efficient way to scale.
However, this idea taken to its extreme, as many newer Internet startups have done, has resulted in only generalist software engineers being hired, with the expectation that these (DevOps) engineers can handle development, QA, and operations.
Fungibility and generalist hiring typically works fine for early startups. However, beyond a certain size, the idea that all engineers are swappable becomes almost absurd for the following reasons:
- Generalists versus specialists. More complex applications and architectures require more specialist skills to succeed, whether that be frontend, infrastructure, client, operations, testing, data science, etc. This does not imply that generalists are no longer useful or that generalists cannot learn to become specialists, it just means that a larger engineering organization requires different types of engineers to succeed.
- All engineers do not like doing all things. Some engineers like being generalists. Some like specializing. Some like writing code. Some like debugging. Some like UI. Some like systems. A growing engineering organization that supports specialists has to grapple with the fact that employee happiness sometimes involves working on certain types of problems and not others.
- All engineers are not good at doing all things. Throughout my career, I have met many amazing people. Some of them can start with empty files in an IDE and create an incredible system from scratch. At the same time, these same people have little intuition for how to run reliable systems, how to debug them, how to monitor them, etc. Conversely, I have been on many infuriating interview loops in which a truly incredible operations engineer who could add tremendous benefit to the overall organization purely via expertise in debugging and innate intuition on how to run reliable systems has been rejected because they did not demonstrate “sufficient coding skills.”
Ironically and hypocritically, organizations such as Amazon and Facebook prioritize fungibility in software engineering roles, but clearly value the split (but still overlapping) skillset between development and operations by continuing to offer different career paths for each.
How and at what organization size does pure DevOps breakdown? What goes wrong?
- Move to microservices. By the time an engineering organization reaches ~75 people, there is almost certainly a central infrastructure team in place starting to build common substrate features required by product teams building microservices.
- Pure DevOps. At the same time, product teams are being told to do DevOps.
- Reliability consultants. At this organization size, the engineers who have gravitated towards working on infrastructure are very likely the same engineers who either have previous operational experience or good intuition in that area. Inevitably, these engineers become de facto SRE/production engineers and help the rest of the organization as consultants while continuing to build the infrastructure required to keep the business running.
- Lack of education. As an industry, we now expect to hire people who can step in and develop and operate Internet services. However, we almost universally do a terrible job of both the new hire and continuing education required to perform this task. How can we expect engineers to have operational intuition when we never teach the skills?
- Support breakdown. As the engineering organization hiring rate continues to ramp, there comes a point at which the central infrastructure team can no longer both continue to build and operate the infrastructure critical to business success, while also maintaining the support burden of helping product teams with operational tasks. The central infrastructure engineers are pulling double duty as organization-wide SRE consultants on top of their existing workload. Everyone understands that education and documentation is critical, but scheduling time to work on those two things is rarely prioritized.
- Burnout. Worse, the situation previously described creates a human toll and reduces morale across the entire organization. Product engineers feel they are being asked to do things they either don’t want to do or have not been taught to do. Infrastructure engineers begin to burnout under the weight of providing support, knowing that education and documentation are needed but unable to prioritize the creation of it, all the while keeping existing systems across the company running at high reliability.
At a certain engineering organization size the wheels start falling off the bus and the organization begins to have human scaling issues with a pure DevOps model supported by a central infrastructure team. I would argue this size is dependent on the current maturity of public cloud native technology and as of this writing is somewhere in the low hundreds of total engineers.
Is there a middle ground between the “old way” and the DevOps way?
For companies older than approximately 10 years, the site reliability or production engineering model has become common. Although implementation varies across companies, the idea is to employ engineers who can wholly focus on reliability engineering while not being beholden to product managers. Some of the implementation details are highly relevant however, and these include:
- Are SREs on-call by themselves or do software engineers share the on-call burden?
- Are SREs doing actual engineering and automation or are they being required to perform only manual and repetitive tasks such as deployments, recurring page resolution, etc.?
- Are SREs part of a central consulting organization or are they embedded within product teams?
The success of the program and its impact on the overall engineering organization is often dependent on the answers to the above questions. However, I firmly believe that at a certain size, the SRE model is the only effective way to scale an engineering organization to a number of engineers beyond the point at a which a pure DevOps model breaks down. In fact, I would argue that successfully bootstrapping an SRE program well in advance of the human scaling limits outlined herein is a critical responsibility of the engineering leadership of a modern hyper-growth Internet company.
What is the right SRE model?
Given the plethora of examples currently implemented in the industry, there is no right answer to this question and all models have their holes and resultant issues. I will outline what I think the sweet spot is based on my observations over the last 10 years:
- Recognize that operations and reliability engineering is a discrete and hugely valuable skillset. Our rush to automate everything and the idea that software engineers are fungible is marginalizing a subset of the engineering workforce that is equally (if not more!) valuable than software engineers. An operations engineer doesn’t have to be comfortable with empty source files just the same as a software engineer doesn’t have to be comfortable debugging and firefighting during a stressful outage. Operations engineers and software engineers are partners, not interchangeable cogs.
- SREs are not on-call, dashboard, and deploy monkeys. They are software engineers who focus on reliability tasks not product tasks. An ideal structure requires all engineers to perform basic operational tasks including on-call, deployments, monitoring, etc. I think this is critically important as it helps to avoid class/job stratification between reliability and software engineers and makes software engineers more directly accountable for product quality.
- SREs should be embedded into product teams, while not reporting to the product team engineering manager. This allows the SREs to scrum with their team, gain mutual trust, and still have appropriate checks and balances in place such that a real conversation can take place when attempting to weigh reliability versus features.
- The goal of embedded SREs is to increase the reliability of their products by implementing reliability oriented features and automation, mentoring and educating the rest of the team on operational best practices, and acting as a liaison between product teams and infrastructure teams (feedback on documentation, pain points, needed features, etc.).
A successful SRE program implemented early in the growth phase as outlined above, along with real investment in new hire and continuing education and documentation, can raise the bar of the entire engineering organization while mitigating many of the human scaling issues previously described.
Very few companies reach the hyper-growth stage at which point this post is directly applicable. For many companies, a pure DevOps model built on modern cloud native primitives may be entirely sufficient given the number of engineers involved, the system reliability required, and the product iteration rate the business requires.
For the relatively few companies for which this post does apply, the key takeaways are:
- DevOps style agile development and automation is required for any new technology company that hopes to compete.
- Publicly available cloud native primitives along with a small central infrastructure team can allow an engineering organization to scale to hundreds of engineers before the operational toll due to lack of education and role specificity starts to emerge.
- Getting ahead of the operational human scaling issues requires a real investment in new hire and continuing education, documentation, and the development of an embedded SRE team that can form a bridge between product teams and infrastructure teams.
Modern hyper-growth Internet companies have (in my opinion) an egregiously large amount of burnout, primarily due to the grueling product demands coupled with a lack of investment in operational infrastructure. I believe it is possible for engineering leadership to buck the trend by getting ahead of operations before it becomes a major impediment to organizational stability.
While newer companies might be under the illusion that advancements in cloud native automation are making the traditional operations engineer obsolete, this could not be further from the truth. For the foreseeable future, even while making use of the latest available technology, engineers who specialize in operations and reliability should be recognized and valued for offering critical skillsets that cannot be obtained any other way, and their vital roles should be formally structured into the engineering organization during the early growth phase.