The Information Systems industry, as an almost inevitable side-effect of dealing in innovation, suffers from a disease of buzzwordification. As new concepts are developed for discussion, new words are needed. As those ideas become popular or trendy, the definitions of words tend to get away from those who coined them. In the worst case, they can diverge to the point of near-meaninglessness. While this can make for some fun bingo games , it also sows an unfortunate amount of miscommunication and resultant confusion.
In this article, I wish to thoroughly tackle “DevOps”, a buzzword close to my heart. If you don’t work in web operations, and you haven’t been looking at software job postings of late, you may never have seen it, but if you work on the web, you likely will before long. If you have seen it, you may or may not understand what it means. You can read the Wikipedia article, but I don’t think it’s terribly illuminating. But even if that makes sense to you, it is a partial and fairly unopinionated treatment. I seek here to be both fuller and more opinionated in laying out my personal understanding. I hope also to be a little less confusing.
As divergent definition is part of the buzzword game, I will seek to address some of the common understandings in turn, and to order my discussion in terms of progressively broader and grander visions of the idea. It is my hope that this will make the divergence seem relatively easy to follow. Let’s begin:
Agile, Automated Infrastructure
If you see a posting for a “devops engineer”, this is what the recruiter or hiring manager was likely thinking of: An expert in certain “configuration management” tools such as [Puppet or Chef] and Vagrant (or more recently Ansible and perhaps Docker) that allow for the automated setup and maintenance of servers (and their services). These servers may be found in a traditional data center, but more than likely, you’ll also find attached a more widely known buzzword—one that has inspired a Microsoft catchphrase and an upcoming R-rated comedy— The Cloud!
(While the cloud is not fundamentally necessary for this “devops”, it provided an important change of norms. What previously needed to be done by physically moving and touching hardware could now be done by API call. These tools expand on that idea.)
To the credit of the job-posters above, the earliest history (as recorded by Damon Edwards) is on the side of this definition. At the strictest level, the word “devops” began as a shortened hashtag for discussion of the first DevOpsDays conference organized by Patrick DeBois in Ghent, Belgium in October 2009. This conference, in turn, arose from two major inspirations:
- John Allspaw’s “10+ Deploys Per Day: Developer and Operations Collaboration at Flickr” presentation at O’Reilly’s 2009 Velocity web performance conference. but before that
- Andrew Clay Shafer’s proposed “Agile Infrastructure” session at the Agile2008 Conference (which literally no one but DeBois attended)
And in this word soup, the roots of this understanding are pretty clearly present.
By this narrow interpretation of DevOps, the goal is basically to eliminate a few common problems with developing and running software as a service:
- Developers’ computers are set up differently than the machines that run the software in production (and also different then other developers’ machines). This can lead to arbitrary behavioral differences and difficulty in debugging.
- Production machines are configured and tuned over time, in a way that is often not well-documented and therefore not repeatable for new machines. This makes scaling by adding new machines very hard.
- Quickly testing (and iterating on) changes to server configuration is untenable due to a) unknown starting state, b) the inability to readily duplicate so as to not test on live traffic, and c) lack of tooling for automated comparison of results
All of these problems are eliminated by the approach known as “Infrastructure as Code” allowed by the sorts of tools listed above.
By defining and changing the configuration as code (or data), setup and maintenance of machines is imminently repeatable and inherently documented. Further, by treating configuration as code, operations staff can benefit from development tools and practices like revision control systems, code review processes, and community/open source libraries.
On top of laying out an initial state that can be spun up on demand (and tinkered with to test things out), the infrastructure code defines a desired state that can be enforced and converged upon over time. In this way, configuration management systems like those above (and the forerunner CFEngine, whose physicist creator Mark Burgess provided this formulation back in 98) can act like a computer’s immune system, maintaining homeostasis in the face of configurational drift, regardless of source.
These systems can be kind of complicated (or at least hard to master), and with their close interrelation (and potential integration) to more traditional operations infrastructure systems concerns—like monitoring, alerting, log-aggregation, networking, and database administration—as well as new concerns created or invigorated by the rise of cloud computing—such as autoscaling, zone failover, and general partition tolerance, it is not terribly surprising that many companies are looking for experts in these things and calling them “devop” or “DevOps engineers”. Different schools of thought, however, suggests that DevOps should fundamentally not be thought of as a job title, but rather a powerful idea for the general operation of IT organizations, and indeed any organization with IT concerns. We move on to those schools…
Tearing down the Wall of Confusion;
Making change everyone’s responsibility
You can’t read about DevOps for long anywhere before you come across a book, interestingly a novel, entitled
The Phoenix Project:
A Novel about IT, DevOps, and Helping Your Business Win.
If you think about DevOps like the imaginary recruiters or hiring managers mentioned above, you are likely wondering how the heck those tools, those fairly technical ideas, fit in a novel. And it’s a good question.
If you read the Phoenix Project, you will actually not find (that I remember) a single mention of a configuration management tool. Not a name drop of one of the big players I mentioned, but not even a general allusion. There is a lot of discussion about “change management”, but not configuration management proper. But the Phoenix Project is a novel about DevOps. It just takes that definition back to it’s more literal roots.
As I mentioned above, the origins of the word DevOps finds its way back, however indirectly, to a talk John Allspaw of Flickr (now of Etsy) gave at Velocity 2009 entitled “10+ Deploys Per Day: Developer and Operations Collaboration at Flickr”. In that title one finds the vision of DevOps here understood: Developer and Operations Collaboration.
The Big Problem
In the IT/SaaS field, it can be broadly claimed that there are two types of professionals whose jobs, while perhaps indistinguishable to the outsider, are in the eyes of many diametrically opposed:
- There are developers (programmers, coders, engineers) whose job it is to deliver new functionality and fix incorrect behavior of what already exists. In short it is their job to keep things changing.
- There are operators (sysadmins, ops-guys, SREs) whose job it is to get everything running, fix hardware failures, deal with resource allocation issues, and in general just keep everything working.
In short it is their job to keep things stable.
In many organizations, these closely related but cross-purposed responsibilities have been kept wholly separate: different people, different departments.
They often don’t especially get along. On top of the diametric opposition of their goals, there is sometimes a mutual lack of appreciation. The devs think that they are doing the real hard work: writing the code, and the ops guys are just ITT tech graduates there to do manual labor and make sure the company doesn’t have to pay Geek Squad. The ops guys meanwhile understand that the whole f#!$ing enterprise rests squarely on their shoulders — that without them literally nothing would work: that the devs could write all the fancy, broken software they want and then throw it in the garbage, and that might be just as good cause at least it wouldn’t set that on fire.
Perhaps the worst practice of this traditional model is the separation of concerns around deployment: It’s the operators’ problem. The developer’s age-old rallying cry is “works on my machine” and then they throw it over “The Wall of Confusion” to operations. The operations department gets an undocumented, non-working chunk of source code and is expected to deploy it over the weekend to hundreds or thousands of servers while the devs go out for margaritas to celebrate “the project being done”. (In my description here, I err on the side of demonizing the developers because that’s who I’ve always been in this play.)
The Big Solution
This sort of problem is institutional. It is not one easily solved by more tools alone. With this in mind, it is not hard to understand the reason that many in “the DevOps community” are against “DevOps Engineers” as a job title and a role — it is in essence just an operator expected to also use these new tools (which will solve some old problems, but also maybe create new ones). It is just painting a different veneer on a new overworked, underappreciated single point of failure, in person or department form.
To these people, including the authors of The Phoenix Project, devops is a new way of thinking (at least a new way for IT): taking a broader perspective, mapping out the whole process from development through deployment (and ideally back around a feedback loop), and distributing responsibilities across that process. In particular hallmarks of this sort of devops include:
- Developers are expected to think about and feel responsible for the deployment process and maintenance of their applications. They are expected to consider it throughout the development process and have a clear plan mapped out. Operations is relied on for expertise and often ultimate implementation, but the developer has to think about it.
- Developers on-call. There is a long standing tradition of operators being woken by a pager in the middle of the night when the servers go down. They are expected to get it back up, and traditionally they were expected to do so without bothering the developer who made the broken system and a) actually knows what it does and how its supposed to do it, b) is very possibly at fault for it being broken. Under the devops paradigm, developers are likely to be responsible for fixing it themselves or at least being available to the operator fixing it. In practice, this is often handled by a rotation — someone from the team is responsible for the systems of the whole team on a rotating basis.
- Operators inform developers of maintenance/problems/etc. Developers have at times been frustrated by operators changing the landscape under their feet without warning, making their already mentally-taxing jobs that much harder. This is seen as unacceptable for this paradigm. Constant communication and shared understanding is key.
While there is more to be said about this idea of DevOps, I think this captures the major idea. DevOps is a cultural shift, likely enabled and enhanced by new tools and technologies, that distributes responsibilities and enhance communication throughout the organization. This allows elimination of waste, elimination of bottle-necks, and altogether a more efficient and happier operation.
Two additional important things to note about this understanding:
- While this formulation suggests that culture is the paramount concern it does not generally claim that it is the only concern. Proponents of this formulation refer to C.A.M.S. : Culture, Automation, Measurement, Sharing as the core concerns of DevOps (in order). Captured in the last is a propensity towards acknowledging failures and generally spreading knowledge both within an organization and often to the public.
- Much of the wisdom captured in this understanding is actually not so much new ideas as importing of older wisdom into the realm of Information Systems. The Phoenix Project itself is a re-imagining of Dr. Eliyahu M. Goldratt’s The Goal, a text that has been found in business school curricula for a very long time. Other often-cited sources of wisdom in this school are The Toyota Production System ideas that underlay the Japanese automobile revolution, and the work of W. Edwards Deming that enabled them. (the latter is a particular darling of John “botchagalupe” Willis, cohost of the DevOps Cafe podcast with Damon Edwards)
For more on this view, I recommend checking out the blog of IT Revolution Press (publishers of The Phoenix Project), as well as the O’Reilly free pamphlet “Building a DevOps Culture”. But wait, there’s more!
& Optimizing for Cycle Time;
Destroy the Pareto Efficient Nash Equilibrium
While I think there is a pretty clear distinction between the first “recruiter” idea of DevOps and the broad “Phoenix Project” idea of DevOps, I am not so sure there is a clear division between the latter understanding and the one I will present here. To put it broadly, the heading above was focusing on “Dev” and “Ops”. This one is specifically about the generalization beyond (but including) that. It’s ideas source for me largely to the continued thought and presentations of Andrew “littleidea” Clay Schafer, whose name appeared above as the historical proposer of the Agile Infrastructure session that helped to inspire DevOpDays. Also perhaps, the book (and other work on) Continuous Delivery by Jez Humble and Dave Farley (which I have unfortunately not managed to read most of yet).
The General Problem
The idea of DevOps presented above focuses essentially on making two potentially warring factions, Dev and Ops, collaborate rather than fight. It turns out that a real world organization has more than two factions, and that greater gains can be perhaps realized by getting everyone on the same page as much as possible. But that doing that is potentially very hard.
A popular—and perhaps purposely over-formal—formulation of this state of affairs going around is the discussion of the Pareto efficient Nash equilibrium. This is a (game theory) technical way of describing the abstract situation in which multiple parties sit at an impasse where there is absolutely no incentive for any one party to change their strategy unilaterally (and indeed to do so would definitely be to someone’s detriment). Notwithstanding this fact, if multiple parties (or perhaps all the parties) were to change strategy, a substantial gain might be realized.
To put it another way, in this view, DevOps is a word which has come to occasionally denote any (potential) large-scale recalibration of (technical) organizations’ culture and practices, focused especially in re-alignment of incentives across functions, to realize gains of efficiency.
The General Solution
For such a broad problem domain, it is hard to suggest particular solutions. And indeed, I think the experts would say that understanding the particulars of your organization’s version of this problem is fundamental to solving your instance of it. That notwithstanding there are many practical ideals that befit such solutions:
- Information should be made widely available, understandable, and understood within an organization.
- Actions and decisions should be tied to the ultimate mission of an organization, and should not focus on short-term gains at the cost of long-term goals.
- Failures should never be hidden (at least internally). Any catastrophe is a critical opportunity to investigate short-comings and improve. Here particularly the practice of “blameless post-mortems” is highly valued.
- Old ideas and decisions should be recorded, explained, and not forgotten. They should also be revisited.
- People should be cultivated (as the word culture is derived) along with the organization. This may involve people learning new skillsets and working across multiple competencies.
In general, I think this can be summarized as enabling continuous improvement through continuous learning. In particular @littleidea points to research about Organizational Learning. Though I have not seen it cited in the DevOps community proper, I also think of Clojure creator Rich Hickey’s thoughts on simplicity and building well-understood systems in his fairly well-known 2011 talk “Simple Made Easy”.
Another thing worth acknowledging explicitly about DevOps as I wrap up is its relationship to “Lean” methodologies and the ideas (associated with Eric Ries) of “the Lean Startup”. Both Lean and the latter formulations of DevOps descend from the ideas of Toyota Production Systems. Both emphasize feedback cycles, elimination of bottle necks, and organizational learning. One of the largest ideas of Lean reflected also in DevOps (or especially the sometimes near-synonym of “Continuous Delivery”) that I did not yet mention explicitly is the idea of optimizing for cycle time: minimizing the time between when someone has an idea and when it has been translated into a full-fledged (or sometimes in Lean workflows, facsimile), working system from which data may be gathered, things may be learned, and new ideas may be generated. The line between DevOps and Lean thinking can get blurry at times.
In Closing: Why I think it’s exciting
At the end of the day, however broad your formulation, the ideas of DevOps come down to making systems better understood and easier to work with. These may be technical systems, these may be people systems; ideally both will be revolutionized and end up better understood. The tooling built around DevOps systems makes being a developer (and hopefully an operator) a more pleasant working experience and a more informative one. I love learning and I’d love for my life to involve tons of it in the process of getting things done. Beyond these general niceties I think there are a few final things to be excited about:
- The power created by automation technologies, especially on cloud platforms, allows companies to serve more customers without growing their staff in proportion. This can enable organizations to be successful without having to deal (as soon) with the downsides of organizational growth.
- Stability and dependability potentially created with such technologies can allow people more time to pursue new ideas and make more improvements.
- Collaboration and cross-training in these cultural models can create more and better ideas, and allow the ideas that get implemented to be implemented better and faster.
It is an exciting time in DevOps land. The technologies are getting better all the time. There are lots of cool ones I didn’t even mention above (e.g. CoreOS, OpenStack, Juju, Salt, Deis). Smart people are thinking, learning, and sharing every day. To learn more and stay up to date, check out these and other podcasts, this online book club, this blog, and any of numerous conferences (who usually post videos of talks online after the fact).
There is also a lot going on on Twitter. I’ve linked to a number of interesting people’s accounts throughout (but failed to include many more). If you have any feedback on the article or want to chat, feel free to hit me up at @donaldguy.