DevOps is counterintuitive

Darío Blanco Iturriaga
10 min readMay 30, 2018

--

IT has dramatically changed in the last 10 years. Now is easier than ever to define infrastructure with code, and applying software development processes to traditional operations is clearly the way forward. We have plenty of cloud providers that serve Google-like platforms with a single click, thus we can focus on development. However, even if many companies use the right technology, they are very far away from achieving the desired DevOps culture. Funnily enough, DevOps adoption can be usually compared to motorbike riding.

Technology vs Engineer improvement

Motorbike evolution in the last 50 years has been staggering. Bikes are lighter, faster and handle better; tires are stickier, suspension systems are electronic and ABS is now everywhere. However, motorcycle teachers haven’t seen an improvement in their students: something is stopping them from using this clearly improved bike technology. Many corporations have a similar problem, they introduce state of the art technologies that enable fast change rate of their applications while maintaining reliability but never reach that goal at its 100%. They basically have the best motorbike for riding curves at high speed but fall off from it every time they try.

The BMW HP4 Carbon brings traction control, lightweight wheels, smoother fuelling, electronic engine braking among others. Yet it requires an experienced driver in order to unlock its full potential

Actually, 95% of the motorbike students reached great levels of confidence after only half day of classroom plus track training, and half of them can be coached to a high degree of technical skill in two days, but only if they ride at 75% of their limit. Regarding DevOps, we can teach engineers to use new technologies quite fast with proper training, while some of them will become proficient after some time if they push their limits. Nonetheless, that’s not good enough.

Survival reactions limit our potential

What happens after 75%? Intuition kicks in, fear appears and panic triggers a survival reaction (SR). Even if the students understood all the standard riding techniques, those reactions ruin the attempts to reach the goals they have envisioned for themselves or even crash. Like in traditional IT, the action that was supposed to prevent a problem, created more problems.

DevOps culture is mainly about bringing fast features to the customer. This is what we define as “change”, which brings risk every time we provide new value to the users, a very well known trade-off. As every change is a potential service disruption, our intuition tells us not to change often, because we are afraid it is going to break: this is a very logical way of thinking. In general, DevOps is all about overcoming the fear and challenging seemingly logical assumptions like this one.

Leaning more, not breaking, rolling-on the gas or looking at the end of the curve aren’t intuitive behaviours

Imagine that introducing change into a system is like cornering with a motorbike, we want to take the curves fast but safe. However, SRs will actually achieve the opposite, those are:

  1. Roll-off the gas: suddenly slow down the pace
  2. Tighten on bars: being tense, especially when you go fast
  3. Narrowed field of view: smaller room to make decisions as you don’t look farther to the road
  4. Fixed attention: focusing on a bad spot make you miss the rest of the turn
  5. Steering in the direction of the fixed attention: ineffective steering due to a wrong focus point
  6. None or ineffective steering: not quick enough, too early or frozen steering
  7. Braking errors: over and under braking

None of these SRs work in harmony with the new technologies that we have at our grasp, they need to be defeated.

The complexity behind counter-intuition

There are a lot of riding techniques that define how to overcome the SRs. My favorite one is called counter-steering (guide in an opposing manner) for several reasons.

Firstly, it was discovered after 80 years of motorbike history. Some riders realized that if they moved their handlebar to the left at certain speed, the motorbike would go to the right. It didn’t make any sense, it was scary and people were reluctant to mention this technique for fear of being told they were nuts. I think this is pretty similar to what happens with DevOps and Continuous Delivery (CD) into production: it seems nuts, it is scary but it works. It wasn’t until the 1970s (~10 years after people started to realize this technique) when a group of Honda researchers presented technical papers about how it worked. Now we are trying to explain how CD works, a core DevOps concept, and we have to fight the same skepticism.

Push to the right at high speed: the handlebar will go to the left and the motorbike will turn to the right

Secondly, the counter-steering explanation is pretty complex, thus very difficult to teach and master: it is called gyroscopic effect. The beautiful thing is how we try to simplify the explanation of this technique; we don’t say “turn left and the motorbike will go to the right”, we instead state “push to the right and the motorbike will go to the right”. Therefore, we try to make it “intuitive”, so riders can start practicing it, the key to making it really automatic. With CD we want to deploy often so the system will be more stable, two seemingly conflicting concepts. If we say “very small deployments”, that doesn’t seem so conflicting anymore.

Thirdly, everybody can drive a motorbike without counter-steering. However, if you want to open vast amounts of improvement in every possible situation that requires steering the bike, you really need to learn this technique. In a competitive world, your business wants to bring new features to the customer while maintaining the reliability of their systems. That is a very difficult goal, technology alone doesn’t solve it and mastering the technique requires overcoming those SRs.

Technology adapted for each case

We want to go fast, but the faster we go the more difficult is to turn. In counter-steering, the gyroscopic effect is the force to beat. Therefore, when building a motorbike we will make trade-offs based on how we want to perform that steering.

A supersport motorbike is made for easy steering at high speed, it will easily turn at 250km/h. It beats the gyroscopic effect by design because the center of mass is very close to the contact patch. However, it is more uncomfortable for long trips as it requires a sports riding position.

Supersport motorbike

Other motorbikes, like choppers, sacrifice steering performance using lengthened forks for a cool stretched-out appearance. As the contact patch is far away from the center of mass, it would be very difficult to steer at high speed, therefore they can’t go fast. In contrast to the supersport bikes, they could be relatively comfortable for long trips in straight highways, because the handlebar places the rider hands just slightly below shoulder height.

Old school chopper

DevOps requires a supersport motorbike; we want to steer and go fast, and what we usually find in most companies is an old school chopper. Not only that, the chopper rider is full of SRs. Therefore, the enterprise wants to win MotoGP, has more than enough resources to do that, but can’t even compete.

Building a supersport system

We know we want to be fast and reliable. And now we accept the complexity it takes from a technical and cultural point of view. Therefore, we need well-trained riders (engineers) and a machine (system) optimized to those goals.

Open source as an SR killer

Engineering is all about applying very well defined patterns based on a specific use case. Engineers understand the trade-offs and make a thoughtful decision based on their experience. However, they are affected by SRs that can lead to wrong choices.

Every successful open source project will have a process like this, so Jane Doe can quickly contribute without harming quality

Thankfully, we have open source as an example of collaboration and velocity. Open source has always used very well defined and proven software processes that can help to overcome those SRs:

  1. Roll-off the gas: unit testing, code review, and automated CI/CD give you the confidence to continue bringing fast (or even fastest) change. Slowing down the pace doesn’t make sense as it still has to go through the automated pipeline, and usually comes with bigger changes that are more unlikely to be approved.
  2. Tighten on bars: relax! Trivial and repetitive checks should be done automatically, no need to bring unnecessary tension checking them yourself.
  3. Narrowed field of view: the engineer not only develops the change, he should test it as well. In addition, some changes will be nonfunctional (like improving the monitoring).
  4. Fixed attention: focusing on a single type of problems will distract you from the others. For instance, there might be programming framework challenges but you shouldn’t forget about how your infrastructure is going to look like. Don’t lose the big picture.
  5. Steering in the direction of the fixed attention: manually fixing those problems will make you go out of the road. Automation is always the right steering direction, even if our attention is momentarily somewhere else.
  6. None or ineffective steering: small and constant changes will keep your system reliable while maintaining speed. If you suddenly steer a lot (big change) it will most probably fail. If you don’t want to steer at all when you need to address a fix, you will also crash.
  7. Braking errors: when the unexpected happens it is ok to brake in order to correct our course. If we have a proper monitoring system we will know how and where to break.

Small and frequent changes

All of this is possible with engineers that are used to work “in an open source way” and who know what technologies to use. Automation is achieved by defining manual processes with software. Version controlling everything will fix many communication problems as engineers understand code, which is the universal language. This enables them to push code fast as the overhead of checking every change is done by a robot, giving instant feedback when something is wrong. Therefore, infrastructure is defined with code and completely transparent.

Like a supersport motorbike, we want to change often without sacrificing speed

Even if you are the best rider in the world, you will run into bugs that reach production and go unnoticed. This is similar to when you corner with a motorbike: sometimes you will find a treacherous curve that doesn’t seem to end. Instead of having a steep angle, break and roll-off the gas, you lean a bit more, maintain the speed and never break. That is the best way if you want to perform the curve safely. When something goes wrong, you might have tools to roll the change back, as it is small anyway. Do a post-mortem and figure out why the mistake went unnoticed, blaming the process instead of people. Make sure you improve your monitoring so it won’t be unnoticed anymore.

Reliability without sacrificing speed

The interesting outcome of our supersport system is that even if we reduce the uptime because of the change ratio, the summed up total downtime is significantly less than in a change-averse system.

Change averse system

Imagine we want to achieve a 99.9% SLA in a change-averse system, it allows a yearly downtime of 8h45m. Without change, the uptime might be much better than in the supersport system at a first glance. However, there are always change vectors like software updates, new features, bugfixes… When that happens it is easy to see downtimes of hours or even days: the change is big, the people involved in the deployment forgot how that system worked as they have no practice, the outage is detected late as there is usually poor monitoring, and that culture encourages homogeneous teams with conflicting goals. If the downtime is more than 9h, it wastes the SLA of an entire year.

Supersport system

Take a look now at our supersport system, which introduces small risks more often. As our engineers are used to deploy every day and they follow strict software development processes, it is unlikely that a deployment will fail. If that happens, it should be possible to perform a rollback in a question of seconds (or even automatically with some state of the art technologies). As the change was small, the rollback is not a big deal, and we can debug in other production-like environments what happened, improving the non-functional systems in the meantime. In addition, the monitoring is quite mature as we are proving it really often with each deployment, thus detecting a problem almost instantly. An SLA of 99.9% would allow a downtime of 10m per week, every week without downtime will add to the budget, and with a rollback mechanism, it is likely we will achieve the SLA.

Tell me more

If you like motorbikes, A Twist of the Wrist Vol. 2 inspired my DevOps analogy.

Robin Bühler, Jean-Francois Landreau and I have started a list of 100 ideas to bring DevOps into an organization. Feel free to collaborate!

You might want to follow Gregor Hohpe as well, he made all of this possible.

--

--