What is DevOps?

3 progressively broader perspectives

A buzzword

In this article, I wish to thoroughly tackle “DevOps”, a buzzword close to my heart. If you don’t work in web operations, and you haven’t been looking at software job postings of late, you may never have seen it, but if you work on the web, you likely will before long. If you have seen it, you may or may not understand what it means. You can read the Wikipedia article, but I don’t think it’s terribly illuminating. But even if that makes sense to you, it is a partial and fairly unopinionated treatment. I seek here to be both fuller and more opinionated in laying out my personal understanding. I hope also to be a little less confusing.

As divergent definition is part of the buzzword game, I will seek to address some of the common understandings in turn, and to order my discussion in terms of progressively broader and grander visions of the idea. It is my hope that this will make the divergence seem relatively easy to follow. Let’s begin:

Agile, Automated Infrastructure

Image for post
Image for post
“Nobody Understands the Cloud! It’s a F#!king Mystery! — Jason Segel, “Sex Tape”, Summer 2014

(While the cloud is not fundamentally necessary for this “devops”, it provided an important change of norms. What previously needed to be done by physically moving and touching hardware could now be done by API call. These tools expand on that idea.)

History

  1. John Allspaw’s “10+ Deploys Per Day: Developer and Operations Collaboration at Flickr” presentation at O’Reilly’s 2009 Velocity web performance conference. but before that
  2. Andrew Clay Shafer’s proposed “Agile Infrastructure” session at the Agile2008 Conference (which literally no one but DeBois attended)

And in this word soup, the roots of this understanding are pretty clearly present.

The Problems

  1. Developers’ computers are set up differently than the machines that run the software in production (and also different then other developers’ machines). This can lead to arbitrary behavioral differences and difficulty in debugging.
  2. Production machines are configured and tuned over time, in a way that is often not well-documented and therefore not repeatable for new machines. This makes scaling by adding new machines very hard.
  3. Quickly testing (and iterating on) changes to server configuration is untenable due to a) unknown starting state, b) the inability to readily duplicate so as to not test on live traffic, and c) lack of tooling for automated comparison of results

The Solution

By defining and changing the configuration as code (or data), setup and maintenance of machines is imminently repeatable and inherently documented. Further, by treating configuration as code, operations staff can benefit from development tools and practices like revision control systems, code review processes, and community/open source libraries.

On top of laying out an initial state that can be spun up on demand (and tinkered with to test things out), the infrastructure code defines a desired state that can be enforced and converged upon over time. In this way, configuration management systems like those above (and the forerunner CFEngine, whose physicist creator Mark Burgess provided this formulation back in 98) can act like a computer’s immune system, maintaining homeostasis in the face of configurational drift, regardless of source.

These systems can be kind of complicated (or at least hard to master), and with their close interrelation (and potential integration) to more traditional operations infrastructure systems concerns—like monitoring, alerting, log-aggregation, networking, and database administration—as well as new concerns created or invigorated by the rise of cloud computing—such as autoscaling, zone failover, and general partition tolerance, it is not terribly surprising that many companies are looking for experts in these things and calling them “devop” or “DevOps engineers”. Different schools of thought, however, suggests that DevOps should fundamentally not be thought of as a job title, but rather a powerful idea for the general operation of IT organizations, and indeed any organization with IT concerns. We move on to those schools…

Tearing down the Wall of Confusion;
Making change everyone’s responsibility

Image for post
Image for post

You can’t read about DevOps for long anywhere before you come across a book, interestingly a novel, entitled

The Phoenix Project:
A Novel about IT, DevOps, and Helping Your Business Win.

It is by Gene Kim, Kevin Behr, and George Spafford. You can find its cover at left.

If you think about DevOps like the imaginary recruiters or hiring managers mentioned above, you are likely wondering how the heck those tools, those fairly technical ideas, fit in a novel. And it’s a good question.

If you read the Phoenix Project, you will actually not find (that I remember) a single mention of a configuration management tool. Not a name drop of one of the big players I mentioned, but not even a general allusion. There is a lot of discussion about “change management”, but not configuration management proper. But the Phoenix Project is a novel about DevOps. It just takes that definition back to it’s more literal roots.

As I mentioned above, the origins of the word DevOps finds its way back, however indirectly, to a talk John Allspaw of Flickr (now of Etsy) gave at Velocity 2009 entitled “10+ Deploys Per Day: Developer and Operations Collaboration at Flickr”. In that title one finds the vision of DevOps here understood: Developer and Operations Collaboration.

The Big Problem

  • There are developers (programmers, coders, engineers) whose job it is to deliver new functionality and fix incorrect behavior of what already exists. In short it is their job to keep things changing.
  • There are operators (sysadmins, ops-guys, SREs) whose job it is to get everything running, fix hardware failures, deal with resource allocation issues, and in general just keep everything working.
    In short it is their job to keep things stable.

In many organizations, these closely related but cross-purposed responsibilities have been kept wholly separate: different people, different departments.

They often don’t especially get along. On top of the diametric opposition of their goals, there is sometimes a mutual lack of appreciation. The devs think that they are doing the real hard work: writing the code, and the ops guys are just ITT tech graduates there to do manual labor and make sure the company doesn’t have to pay Geek Squad. The ops guys meanwhile understand that the whole f#!$ing enterprise rests squarely on their shoulders — that without them literally nothing would work: that the devs could write all the fancy, broken software they want and then throw it in the garbage, and that might be just as good cause at least it wouldn’t set that on fire.

Perhaps the worst practice of this traditional model is the separation of concerns around deployment: It’s the operators’ problem. The developer’s age-old rallying cry is “works on my machine” and then they throw it over “The Wall of Confusion” to operations. The operations department gets an undocumented, non-working chunk of source code and is expected to deploy it over the weekend to hundreds or thousands of servers while the devs go out for margaritas to celebrate “the project being done”. (In my description here, I err on the side of demonizing the developers because that’s who I’ve always been in this play.)

The Big Solution

To these people, including the authors of The Phoenix Project, devops is a new way of thinking (at least a new way for IT): taking a broader perspective, mapping out the whole process from development through deployment (and ideally back around a feedback loop), and distributing responsibilities across that process. In particular hallmarks of this sort of devops include:

  1. Developers are expected to think about and feel responsible for the deployment process and maintenance of their applications. They are expected to consider it throughout the development process and have a clear plan mapped out. Operations is relied on for expertise and often ultimate implementation, but the developer has to think about it.
  2. Developers on-call. There is a long standing tradition of operators being woken by a pager in the middle of the night when the servers go down. They are expected to get it back up, and traditionally they were expected to do so without bothering the developer who made the broken system and a) actually knows what it does and how its supposed to do it, b) is very possibly at fault for it being broken. Under the devops paradigm, developers are likely to be responsible for fixing it themselves or at least being available to the operator fixing it. In practice, this is often handled by a rotation — someone from the team is responsible for the systems of the whole team on a rotating basis.
  3. Operators inform developers of maintenance/problems/etc. Developers have at times been frustrated by operators changing the landscape under their feet without warning, making their already mentally-taxing jobs that much harder. This is seen as unacceptable for this paradigm. Constant communication and shared understanding is key.

While there is more to be said about this idea of DevOps, I think this captures the major idea. DevOps is a cultural shift, likely enabled and enhanced by new tools and technologies, that distributes responsibilities and enhance communication throughout the organization. This allows elimination of waste, elimination of bottle-necks, and altogether a more efficient and happier operation.

Two additional important things to note about this understanding:

  1. While this formulation suggests that culture is the paramount concern it does not generally claim that it is the only concern. Proponents of this formulation refer to C.A.M.S. : Culture, Automation, Measurement, Sharing as the core concerns of DevOps (in order). Captured in the last is a propensity towards acknowledging failures and generally spreading knowledge both within an organization and often to the public.
  2. Much of the wisdom captured in this understanding is actually not so much new ideas as importing of older wisdom into the realm of Information Systems. The Phoenix Project itself is a re-imagining of Dr. Eliyahu M. Goldratt’s The Goal, a text that has been found in business school curricula for a very long time. Other often-cited sources of wisdom in this school are The Toyota Production System ideas that underlay the Japanese automobile revolution, and the work of W. Edwards Deming that enabled them. (the latter is a particular darling of John “botchagalupe” Willis, cohost of the DevOps Cafe podcast with Damon Edwards)

For more on this view, I recommend checking out the blog of IT Revolution Press (publishers of The Phoenix Project), as well as the O’Reilly free pamphlet “Building a DevOps Culture”. But wait, there’s more!

Organizational Learning
& Optimizing for Cycle Time;
Destroy the Pareto Efficient Nash Equilibrium

The General Problem

A popular—and perhaps purposely over-formal—formulation of this state of affairs going around is the discussion of the Pareto efficient Nash equilibrium. This is a (game theory) technical way of describing the abstract situation in which multiple parties sit at an impasse where there is absolutely no incentive for any one party to change their strategy unilaterally (and indeed to do so would definitely be to someone’s detriment). Notwithstanding this fact, if multiple parties (or perhaps all the parties) were to change strategy, a substantial gain might be realized.

To put it another way, in this view, DevOps is a word which has come to occasionally denote any (potential) large-scale recalibration of (technical) organizations’ culture and practices, focused especially in re-alignment of incentives across functions, to realize gains of efficiency.

The General Solution

  1. Information should be made widely available, understandable, and understood within an organization.
  2. Actions and decisions should be tied to the ultimate mission of an organization, and should not focus on short-term gains at the cost of long-term goals.
  3. Failures should never be hidden (at least internally). Any catastrophe is a critical opportunity to investigate short-comings and improve. Here particularly the practice of “blameless post-mortems” is highly valued.
  4. Old ideas and decisions should be recorded, explained, and not forgotten. They should also be revisited.
  5. People should be cultivated (as the word culture is derived) along with the organization. This may involve people learning new skillsets and working across multiple competencies.

In general, I think this can be summarized as enabling continuous improvement through continuous learning. In particular @littleidea points to research about Organizational Learning. Though I have not seen it cited in the DevOps community proper, I also think of Clojure creator Rich Hickey’s thoughts on simplicity and building well-understood systems in his fairly well-known 2011 talk “Simple Made Easy”.

Lean

In Closing: Why I think it’s exciting

  1. The power created by automation technologies, especially on cloud platforms, allows companies to serve more customers without growing their staff in proportion. This can enable organizations to be successful without having to deal (as soon) with the downsides of organizational growth.
  2. Stability and dependability potentially created with such technologies can allow people more time to pursue new ideas and make more improvements.
  3. Collaboration and cross-training in these cultural models can create more and better ideas, and allow the ideas that get implemented to be implemented better and faster.

It is an exciting time in DevOps land. The technologies are getting better all the time. There are lots of cool ones I didn’t even mention above (e.g. CoreOS, OpenStack, Juju, Salt, Deis). Smart people are thinking, learning, and sharing every day. To learn more and stay up to date, check out these and other podcasts, this online book club, this blog, and any of numerous conferences (who usually post videos of talks online after the fact).

There is also a lot going on on Twitter. I’ve linked to a number of interesting people’s accounts throughout (but failed to include many more). If you have any feedback on the article or want to chat, feel free to hit me up at @donaldguy.

Happy DevOpsing!

Was gonna write every day … might again.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store