How Do YOU Define “DevOps?”

Chad Robinson
8 min readFeb 3, 2017

--

Google Trends Search Term Popularity for “DevOps”, 2004-Present

You would have to have been hiding under a rock to have missed the explosion of interest in the term “DevOps,” and the rapid shift in the IT industry towards its practices and principles. In just the past 5 years, “DevOps Engineer” has gone from a non-existent job title to one of the highest-paying IT roles available. But if the term is so new, where did all these wonderfully talented and experienced DevOps Engineers come from?

The answer, obviously, is that “DevOps” did not suddenly spring into existence. Wikipedia’s DevOps Etymology tracks it back to part of the title of the “devopsdays” conference series in Belgium, which itself followed from a presentation given by Andrew Clay Shafer and Patrick Debois at the Agile 2008 conference. However, even by this time many companies already had in place what would today come to be called DevOps teams, tools, and processes.

But how do we frame a conversation around the topic today? If your sysadmins know Python, do you now have a DevOps team? Is your infrastructure “Agile” if you task a team member with implementing Ansible? Should you be hiring “DevOps Engineers” or “Site Reliability Engineers”… or both? What, exactly, is DevOps, anyway?

A little light Googling should convince you that there are as many definitions of the term as there are “experts” on the subject. But before I share my own, let me raise a cautionary flag against over-reliance on generalizations and definitions as actual strategies. I learned an important lesson myself about truisms and axioms as an IT analyst in the late 1990’s and early 2000’s, and that lesson applies here today.

At that time, “Best Practices” were immensely popular, and part of my job was to poll IT executives for their experiences and anecdotes, identify the patterns they had in common, then produce Best Practices research articles to share back to the community. This was a rewarding and educational process, but also one with a hidden gotcha: it was possible to rely too much on Best Practices, which are just generalizations, and fail to incorporate the needs and characteristics of each organization.

Every company has different needs, and a “Best Practice” could be absolutely wrong in some situations. It takes experience to know when this is happening, and making these determinations is one of any IT executive’s most important duties. The same can happen as we attempt to solidify our definition of DevOps. I encourage you to create your own definition, one that takes into account the specific needs of your organization.

My definition incorporates both tactical and strategic objectives:

The goal of DevOps is to cultivate an IT environment that emphasizes rapid delivery of reliable, high-quality products through process automation, procedurally defined infrastructures, and increased collaboration and communication between Dev and Ops teams.

What does this mean? Here’s how I break it down into actual implementation:

Cultivation, Not Revolution or Replacement

Embracing DevOps should start with a cultural shift, not a rush to build a new team or adopt a specific tool. While faster release cycles and improved uptimes are noble goals, DevOps isn’t really about those specific details. Instead, it is about optimizing the performance of the IT department as a whole by integrating and/or connecting teams that used to be more isolated, and more important, sharing concepts, tools, and lessons between them all.

Absent from the definition above is any reference to specific ideas such as containerization or continuous integration. These may be indispensable to a modern DevOps team, but only as enablers. The culture and processes must be established and embraced first. As with the Agile philosophy that “DevOps” partly emerged from, the Kool-Aid must be drunk at all levels, from the most junior team member up to the C-suite.

But you don’t need to take my word for it. Have a look at GitLab’s post-mortem on a 6-hour outage they recently had due to a sequence of human errors, and note a few critical quotes near the conclusion of the writeup:

  1. “The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented”
  2. “Our backups to S3 apparently don’t work either: the bucket is empty”
  3. “So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.”

Note that I am not pointing fingers here —I applaud Gitlab’s transparency in sharing these details, and anybody who claims decades of experience in any significant IT role has either seen the same or worse, or is not as experienced as they think. The reason this is such an important story is because Gitlab produces a CI/CD product. The very principles we’re promoting here are their business. Clearly, somebody wasn’t drinking the Kool-Aid!

Reliable, High-Quality Products

This one seems so obvious… of course we all want our products to be reliable. But it is easy to get lost in details or distracted by “ghost” numbers in efforts such as these, so I included a reminder that the overall strategic goal should be improving end-user experiences.

In “Jack: Straight From the Gut,” Jack Welch of GE fame noted “If we reduced product delivery times from an average of 16 days to 8 days, for example, we saw it as a 50 percent improvement… Foolishly, we were celebrating.”

Foolishly, because GE’s appliance customers at that time were still experiencing high variance in delivery dates. If your refrigerator dies the week before Thanksgiving, a delivery window of 2–14 days isn’t helpful. What you really want to know is: should you buy a turkey now and risk having it spoil, or wait and risk having them sell out? Six Sigma allowed GE to focus on reducing delivery time variance, rather than the average.

GE’s customers weren’t impressed by average delivery-time improvements, and an IT organization’s “customers” are no different. End users don’t see, know, or care about velocity, patch compliance levels, or provisioning times during scaling events. These are useful internal measurements on a tactical level, but failing to identify actual end-user experiences and work toward improving them will make any new concept a waste of time and money. That doesn’t mean invisible achievements are valueless, just that we must never forget that the overarching goal should be to benefit end users.

Process Automation

I focused on three action items in my definition. Process automation is first not because it is the most important, but because it should be the easiest. This was one of the trends that led us to DevOps in the first place, and there are so many tools available today that the biggest failure would not be to choose the wrong one, it would be to fail to act in the first place.

Bear in mind, the goal is not to remove humans from the loop entirely. It is to allow them to act deliberately, thoughtfully, and collaboratively much earlier in the process. A thorough strategy may include any number of tools, including build process automation, “chatops,” automated testing suites, immutable containers, and self-healing systems. But no matter what combination is selected, the purpose is to avoid running database management command-line tools when replication is failing and the service is down… which is how we arrive here:

“… decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com.”

Oops.

The goal is not to remove humans from the loop. It is to shift the bulk of their actions to an earlier point in the process, allowing them to act more deliberately, thoughtfully, and collaboratively.

Procedurally Defined Infrastructures

System administrators have had “change management” tools for years — if you threw a bunch of shell scripts into CVS, you could claim to have been doing this in the mid-90’s if you were so inclined. But many attempts to formalize how production infrastructures were defined more often than not ended up turning into a pile of Wiki pages. Documentation-oriented approaches are subject to inaccuracy, incompleteness, and obsolescence, and are “slow” resources during emergencies because they must be found, read, and interpreted by humans, then manually acted upon.

Procedurally defined infrastructures flip this equation on its head. The act of using an infrastructure definition to create that very infrastructure is important on its own because of the clarity and consistency it adds to the process. But this first step also enables all kinds of downstream benefits as well. Here are just four of my favorites:

  1. During scaling or failure events, automated processes can be used to normalize the production infrastructure against the specification.
  2. It is possible to produce much more real-world Staging, QA, and Dev environments when all teams work from precisely the same definition.
  3. Procedurally defined infrastructures can be subject to the same automated testing as software code.
  4. Modern tools that support procedurally defined infrastructures also make it much easier to implement cross-, hybrid-, and/or private-cloud stacks.

Increased Communication and Collaboration

I often like to close my statements with the most important point, and collaboration would certainly be that. We have all heard jokes about developers “throwing things over the wall” to sysadmins who can’t code. But the truth is, it’s actually a rare sysadmin today that doesn’t know and use any number of languages professionally. And it is not credible to claim to be a “Full-Stack Engineer” without a mastery of CDNs, WebSocket stacks, and any number of cloud services.

To me, DevOps is more about enabling and embracing an end-to-end view of the entire product lifecycle. Think back to the last “Agile SDLC” diagram you saw. How detailed was its coverage of QA, Staging, and Ops? Did it cover it at all? In my experience, the vast majority of SDLC diagrams and discussions do not, which is a tragedy if you accept my earlier argument that the overall strategic goal should be improving the customer’s experience, because Production is where the customer lives.

Communication and collaboration become the “why” behind any product or process selection in a DevOps effort. Why use Terraform + Ansible to define the production environment for a new product? Because it allows both Dev and Ops teams to collaborate on the specification for that environment in a consistent and meaningful way. Why are immutable containers such an interesting concept in CI/CD workflows? Because they force a shift in how patch-management and system configuration are done to a collaborative, planned, and structured approach.

At the end of the day, I believe being able to answer “why” questions such as these is the best way to tell when you are “doing DevOps right.” If you can tie each product you’ve selected or process you’ve adopted back to improving collaboration and communication across the team, and ultimately to improving the customer’s overall experience, you are winning the game.

--

--