The human scalability of “DevOps”

As I recently wrote on Twitter, I’ve been spending a considerable amount of time lately thinking about the human scalability of “DevOps.” (I use quotes around DevOps because there are varied definitions which I will cover shortly.) I’ve come to the conclusion that while DevOps can work extremely well for small engineering organizations, the practice can lead to considerable human/organizational scaling issues without careful thought and management.

What is DevOps

The term DevOps means different things to different people. Before I dive into my thinking on the subject, I think it’s important to be clear about what DevOps means to me.

Wikipedia defines DevOps as:

DevOps (a clipped compound of “development” and “operations”) is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops). The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. DevOps aims at shorter development cycles, increased deployment frequency, and more dependable releases, in close alignment with business objectives.

I use a more narrow and slightly different definition of DevOps:

DevOps is the practice of developers being responsible for operating their services in production, 24/7. This includes development using shared infrastructure primitives, testing, on-call, reliability engineering, disaster recovery, defining SLOs, monitoring setup and alarming, debugging and performance analysis, incident root cause analysis, provisioning and deployment, etc.

The distinction between the Wikipedia definition and my definition (a development philosophy versus an operational strategy) is important and is framed by my personal industry experience. Part of the DevOps “movement” is to introduce slow moving “legacy” enterprises to the benefits of modern highly automated infrastructure and development practices. These include things like: loosely coupled services, APIs, and teams; continuous integration; small iterative deployments from master; agile communication and planning; cloud native elastic infrastructure; etc.

For the last 10 years of my career, I have worked at hyper-growth Internet companies including AWS EC2, pre-IPO Twitter, and Lyft. Additionally, primarily due to creating and talking about Envoy, I’ve spent the last two years meeting with and learning about the technical architectures and organizational structures of myriad primarily hyper-growth Internet companies. For all of these companies, embracing automation, agile development/planning, and other DevOps “best practices” is a given as the productivity improvements are well understood. Instead, the challenge for these engineering organizations is how to balance system reliability against the extreme pressures of business growth, personnel growth, and competition (both business and hiring). The rest of this post is based on my personal experience, which I recognize may not be applicable to all situations, especially slower moving traditional enterprises that may be used to deploying their monolithic software once per quarter and are being coaxed into more rapid and agile development practices.

A brief history of operating Internet applications

Over the past approximately thirty years of what I would call the modern Internet era, Internet application development and operation has gone through (in my opinion) three distinct phases.

  1. During the first phase, Internet applications were built and deployed similarly to how “shrink-wrapped” software was shipped. Three distinct job roles (development, quality assurance, and operations) would collaborate to move applications from development to production over typically extremely long engineering cycles. During this phase, every application was deployed in a dedicated data center or colocation facility, further necessitating operations personnel who were familiar with site-specific network, hardware, and systems administration.

The movement towards not hiring dedicated operations personnel in phase three companies is critically important. Although, clearly, such a company does not need full-time system administrators to manage machines in a colocation facility, the type of person who would have previously filled such a job also typically provided other 20% skills such as system debugging, performance profiling, operational intuition, etc. New companies are being built with a workforce that lacks critical, not easily replaceable, skillsets.

Why does DevOps work well for modern Internet startups?

DevOps, as I have defined it (engineers who develop and operate their services), works extremely well for modern Internet startups for a couple of different reasons:

  1. In my experience, successful early stage startup engineers are a special breed of engineer. They are risk tolerant, extremely quick learners, comfortable getting things done as fast as possible regardless of the tech debt incurred, can often work in multiple systems and languages, and typically have prior experience with operations and systems administration, or are willing to learn as they go. In short, the typical startup engineer is uniquely suited to being a DevOps engineer, whether they want to call themselves one or not (I prefer not 😉).

What happens when a modern Internet startup undergoes hyper-growth?

Most startups fail. That’s the reality. As such, any early startup that is spending a lot of time creating infrastructure in the image of Google is just wasting time. I always tell people to stick with their monolithic architecture and not worry about anything else until human scalability issues (communication, planning, tight coupling, etc.) necessitate a move towards a more decoupled architecture.

So what happens when a modern (phase three) Internet startup finds success and enters hyper-growth? A couple of interesting things start happening at the same time:

  1. The rate of personnel growth rapidly increases, causing severe strains on communication and engineering efficiency. I highly recommend reading The Mythical Man-Month (which is still largely relevant almost 50 years after its initial publication) for more information on this topic.

Central infrastructure teams

Almost universally following the early startup phase, modern Internet hyper-growth companies end up structuring their engineering organizations similarly. This common structure consists of a central infrastructure team supporting a set of vertical product teams practicing DevOps (whether they call it that or not).

The reason the central infrastructure team is so common is that, as I discussed above, hyper-growth brings with it an associated set of changes, both with people and underlying technology, and the reality is that state of the art cloud native technology is still too hard to use if every product engineering team has to individually solve common problems around networking, observability, deployment, provisioning, caching, data storage, etc. As an industry we are tens of years away from “serverless” technologies being robust enough to fully support highly reliable, large-scale, and realtime Internet applications in which the entire engineering organization can largely focus on business logic.

Thus, the central infrastructure team was born to solve problems for the larger engineering organization above and beyond what the base cloud native infrastructure primitives provide. Clearly, Google’s infrastructure team is orders of magnitude larger than that of a company like Lyft because Google is solving foundational problems starting at the data center level, while Lyft relies on a substantial number of publicly available primitives. However, the underlying reasons for creating a central infrastructure organization are the same in both cases: abstract as much infrastructure as possible so that application/product developers can focus on business logic.

The fallacy of fungibility

Finally, we arrive at the idea of “fungibility,” which is the crux of the failure of the pure DevOps model when organizations scale beyond a certain number of engineers. Fungibility is the idea that all engineers are created equal and can do all things. Whether stated as an explicit hiring goal (as at least Amazon does and perhaps others), or made obvious by “bootcamp” like hiring practices in which engineers are hired without a team or role in mind, fungibility has become a popular component of modern engineering philosophy over the last 10–15 years at many companies. Why is this?

  • As I already described, modern cloud native technology and abstractions allow extremely feature rich applications to be built with increasingly sophisticated infrastructure abstractions. Naturally, some specialist skills such as data center design and operations are no longer required for most companies.

However, this idea taken to its extreme, as many newer Internet startups have done, has resulted in only generalist software engineers being hired, with the expectation that these (DevOps) engineers can handle development, QA, and operations.

Fungibility and generalist hiring typically works fine for early startups. However, beyond a certain size, the idea that all engineers are swappable becomes almost absurd for the following reasons:

  • Generalists versus specialists. More complex applications and architectures require more specialist skills to succeed, whether that be frontend, infrastructure, client, operations, testing, data science, etc. This does not imply that generalists are no longer useful or that generalists cannot learn to become specialists, it just means that a larger engineering organization requires different types of engineers to succeed.

Ironically and hypocritically, organizations such as Amazon and Facebook prioritize fungibility in software engineering roles, but clearly value the split (but still overlapping) skillset between development and operations by continuing to offer different career paths for each.

The breakdown

How and at what organization size does pure DevOps breakdown? What goes wrong?

  • Move to microservices. By the time an engineering organization reaches ~75 people, there is almost certainly a central infrastructure team in place starting to build common substrate features required by product teams building microservices.

At a certain engineering organization size the wheels start falling off the bus and the organization begins to have human scaling issues with a pure DevOps model supported by a central infrastructure team. I would argue this size is dependent on the current maturity of public cloud native technology and as of this writing is somewhere in the low hundreds of total engineers.

Is there a middle ground between the “old way” and the DevOps way?

For companies older than approximately 10 years, the site reliability or production engineering model has become common. Although implementation varies across companies, the idea is to employ engineers who can wholly focus on reliability engineering while not being beholden to product managers. Some of the implementation details are highly relevant however, and these include:

  • Are SREs on-call by themselves or do software engineers share the on-call burden?

The success of the program and its impact on the overall engineering organization is often dependent on the answers to the above questions. However, I firmly believe that at a certain size, the SRE model is the only effective way to scale an engineering organization to a number of engineers beyond the point at a which a pure DevOps model breaks down. In fact, I would argue that successfully bootstrapping an SRE program well in advance of the human scaling limits outlined herein is a critical responsibility of the engineering leadership of a modern hyper-growth Internet company.

What is the right SRE model?

Given the plethora of examples currently implemented in the industry, there is no right answer to this question and all models have their holes and resultant issues. I will outline what I think the sweet spot is based on my observations over the last 10 years:

  • Recognize that operations and reliability engineering is a discrete and hugely valuable skillset. Our rush to automate everything and the idea that software engineers are fungible is marginalizing a subset of the engineering workforce that is equally (if not more!) valuable than software engineers. An operations engineer doesn’t have to be comfortable with empty source files just the same as a software engineer doesn’t have to be comfortable debugging and firefighting during a stressful outage. Operations engineers and software engineers are partners, not interchangeable cogs.

A successful SRE program implemented early in the growth phase as outlined above, along with real investment in new hire and continuing education and documentation, can raise the bar of the entire engineering organization while mitigating many of the human scaling issues previously described.

Conclusion

Very few companies reach the hyper-growth stage at which point this post is directly applicable. For many companies, a pure DevOps model built on modern cloud native primitives may be entirely sufficient given the number of engineers involved, the system reliability required, and the product iteration rate the business requires.

For the relatively few companies for which this post does apply, the key takeaways are:

  • DevOps style agile development and automation is required for any new technology company that hopes to compete.

Modern hyper-growth Internet companies have (in my opinion) an egregiously large amount of burnout, primarily due to the grueling product demands coupled with a lack of investment in operational infrastructure. I believe it is possible for engineering leadership to buck the trend by getting ahead of operations before it becomes a major impediment to organizational stability.

While newer companies might be under the illusion that advancements in cloud native automation are making the traditional operations engineer obsolete, this could not be further from the truth. For the foreseeable future, even while making use of the latest available technology, engineers who specialize in operations and reliability should be recognized and valued for offering critical skillsets that cannot be obtained any other way, and their vital roles should be formally structured into the engineering organization during the early growth phase.