Everyone is not Ops

Yesterday was Sysadmin Appreciation Day. There was a lot of chatter about what the future of Operations will look like, a recurrent theme being that in this day and age, Operations is “everyone’s job” or that “everyone is Ops”.

While I think people who believe this have their hearts in the right place, it’s a tad simplistic or opportunistic view. The reality on the ground happens to be more nuanced, and the problems facing most organizations are highly unlikely to be solved by idealistic chants of “everyone is Ops”.

Operations is a shared responsibility

We build systems. We write code to this end. Systems aren’t, however, defined by code alone.

Code is only ever a small component of the system. Systems are primarily conceived to fulfill a business requirement and are defined by, among others, the following characteristics:

 — correctness
 — reliability
 — uptime
 — ease of maintenance
 — observability
 — documentation 
 — extensibility

That the operation of systems is a shared responsibility between those who author the code and those who are tasked with running it is something I see eye to eye with.

The higher the complexity of a system, the more its failure modes. The way “failure” is usually tested is artificially or in very contrived environments which look absolutely nothing like what a real production environment might look like. Testing can at best only approximate the unpredictability of inputs or the difference in behaviors exhibited by the various components the system relies on.

With testing being a best-effort verification of the behavior of a system, instrumentation can no longer be an afterthought. The people best positioned to decide what instrumentation would come in most handy on a rainy day are the software engineers writing the code. Software Engineers are the ones most familiar with the things that will prove to be pivotal when it comes to the performance, reliability and correctness of the service when in production, such as:

— the assumptions and tradeoffs underpinning the abstractions
 — the concurrency model of the programming language used
 — the bottlenecks and known limitations of the system
 — the behavior of the third-party libraries
 — the performance characteristics of the system as a whole

That doesn’t mean “everyone is Ops”

Operations is a pretty varied field. I find it rather strange that on the one hand we’re perfectly capable of distinguishing between frontend engineering and API development or iOS development and “data science”, but on the other, when we talk about “Operations” we treat it as if it were a monolithic discipline and treat anyone and everyone working in Operations as “sysadmins” or “DevOps engineers”.

Operations comprises of everything from:

—deploying applications
 — “monitoring”
 — running edge proxies like nginx
 — operating databases like MySQL or MongoDB
 — operating caches like memcached or Varnish
 — running message brokers like Kafka
 — running systems like Zookeeper or Consul
 — running a scheduler like Kubernetes
 — configuring servers with Chef or Puppet
 — managing DNS setups across various hosted providers
 — renewing SSL certs
 — configuring firewalls or subnets or VLANS

The list is endless. It’s risible to even suggest that, outside of extremely small companies, “software engineers” focused on product will be doing all of the above in addition to building the core product. While this isn’t unheard of at early stage startups, it’s hardly something that’s scalable or even ideal. It’s certainly possible for an engineer to write Javascript code that runs in a browser and write backend infrastructure code and manage a MySQL cluster fronted by memcached and manage AWS infrastructure all at the same time, but it’s unlikely the same engineer can operate in the same capacity once the company reaches a certain scale or once the product gets sufficiently complex enough that scaling each component of the initial stack is going to require specialized knowledge.

It’s one thing to operate memcached when it’s serving a few hundred requests per second; it’s an entirely different kettle of fish when we’re looking at millions of requests per second. Similarly, operating a sharded MySQL cluster servicing hundreds of thousands of writes per second when one has to also reason about MySQL’s I/O thread performance characteristics or remaster a replica with zero downtime is not something your jack-of-all-trades software engineer can pull off with great aplomb.

Won’t Automation solve that?

Every now and again we hear stories of someone who automated themselves out of a job. Automation on the whole is an enormously good thing that has tremendous benefits. Automation is also a fairly foolproof way in which one can truly scale Operations in a sane manner across teams.

Except, automation is not a silver bullet. When we talk about automation in an Operations context, we talk about how the ideal for an Operations engineer is to “automate everything”. In fact, the primacy of automation is considered so sacrosanct that we rarely talk about the flip side of it.

The problem with automation is the same as the problem with all abstractions — in that at the end of the day, it’s leaky.

The famous article that posited the law of leaky abstractions states that:

… all abstractions, leak, and the only way to deal with the leaks competently is to learn about how the abstractions work and what they are abstracting. So the abstractions save us time working, but they don’t save us time learning.

Oftentimes, I hear talk about how certain tools will make developers “better at Ops”. While, in general, better tooling is definitely a net positive, automation can be best leveraged when the person using the automation understands the underlying abstractions, whether it’s a complex deployment mechanism that is being abstracted by a Slack command or a database primary failover or even a routine testing of restoration a database backup. Handing a powerful tool that automates routine database maintenance to a developer who has only ever interacted with databases via an ORM is pretty much useless.

And all this means that paradoxically, even as we have higher and higher level programming tools with better and better abstractions, becoming a proficient programmer is getting harder and harder.

Automation isn’t making developers better at Ops. One could argue that it isn’t even making Operations engineers better at Ops. Automation tools make a ton of assumptions and tradeoffs that only become obvious to someone who understands the consequences of these decisions.

The pragmatic middle-ground

So then, where does that leave us?

Luckily, it doesn’t have to be all or nothing. I believe in a solid and pragmatic middle ground.

Here’s what the traditional split of responsibilities (or lack thereof) looked like:

The traditional split of responsibilities (or lack thereof) between Dev and Ops

This isn’t sustainable any longer (not that it ever was). Application developers need to be responsible for not just writing code but also for managing the entire life-cycle of applications, ensuring its health, maintainability, observability, ease of debugging and its ultimate graceful demise. This includes being responsible for deployments, rollbacks, monitoring and debugging, in addition to bug fixes and new feature development.

The systems the application interacts with could be other applications maintained by the same team, or applications maintained by other teams, or caches/databases/proxies/message brokers. Depending on how deeply the application in question integrates with each of these aforementioned systems, it’d be ideal if the developer also had a baseline understanding of all the other systems so as to be able to escalate to the right team when the performance of one of these upstreams is adversely affecting the performance or reliability of the application in question.

Here’s what the ideal split of responsibilities looks like:

The ideal split of responsibilities between Dev and Ops

Operations engineers aren’t going to be out of a job. Like how software engineers are the ones ideally suited for instrumenting and being on call for the applications they author, operations engineers are the ones suited to building (or buying) automation and tooling to help them perform their job more reliably and safely.

We’re looking at a future where responsibilities are shared not usurped. Operations, per se, is not everyone’s job. What, however, is everyone’s job is ensuring holistic software lifecycle, achieved when Dev and Ops work together.

Happy (belated) Sysadmin Appreciation Day!