Sometimes Boring Is Better

https://flic.kr/p/aQqz1F

The recent announcement of Docker Enterprise Edition brought back some bad memories. After having used Docker in production for two years, I honestly don’t have a lot of faith in its overall stability and direction — enterprise or not. CoreOS failed to ship a stable Docker version on multiple occasions despite their efforts to provide a production-grade Container Linux. Throw ecs-agent into the mix, and you’re guaranteed to have a lot of headaches due to broken/blocked cluster updates. Once bitten, twice shy.

To be fair, using a supposedly stable tool or OS does not relieve you from doing your own sanity testing. As engineers know, many problems only manifest themselves at scale in production (which is why verifying things in staging only works to a certain degree). Also, remember that we’re dealing with complex web systems, which are prone to failure.

So yes, we should cut Docker, CoreOS, and Amazon ECS some slack. All of them have their place. All of them are by themselves groundbreaking technologies that, taken together, enable us to do amazing things, like running a company-wide PaaS for production services. No, I don’t blame them. In fact, I’m glad they exist. On the other hand, they’re still good examples for making the following point.

What do Docker, CoreOS, and ECS have in common? All three are relatively new technologies. Some might even call them “bleeding edge” (I won’t). In any case, all three are the opposite of boring — they’re rather hip and shiny. The point of this article is that, when it comes to technology, sometimes boring is actually better.

Over the last couple of months, I’ve read a number of articles on the merits of choosing boring technology. Dan McKinley’s article is without a doubt one of the best pieces on the topic. It’s worth reading from beginning to end, but here are some of my favorite quotes:

The nice thing about boringness (so constrained) is that the capabilities of these things are well understood. But more importantly, their failure modes are well understood. […] But for shiny new technology the magnitude of unknown unknowns is significantly larger, and this is important.

In other words, software that has been around for a decade is well understood and has fewer unknowns. Fewer unknowns mean less operational overhead, which is a good thing.

One of the most worthwhile exercises I recommend here is to consider how you would solve your immediate problem without adding anything new. […] It’s helpful to write down exactly what it is about the current stack that makes solving the problem prohibitively expensive and difficult.

New systems mean new problems, so we should think twice before adding anything new to an otherwise boring and well-understood stack.

set clear expectations about migrating old functionality to the new system. The policy should typically be “we’re committed to migrating,” with a proposed timeline. The intention of this step is to keep wreckage at manageable levels, and to avoid proliferating locally-optimal solutions.

Timeboxing migrations is an excellent idea I probably should have applied a couple of times in the past. As for locally-optimal solutions, like using new technology X in a single place without good reason, John Allspaw had the following to say in an interview about Etsy:

we want to exploit the advantages of having a relatively finite number of well-known tools. […] the advantages of being more optimal do not outweigh the advantages of using the same language [PHP] a lot. […] In the same way, there’s a massive advantage in using a default data store, MySQL.

He points out that if something breaks for some reason, each new tool is another thing an engineer has to understand, making it more difficult for the company to be resilient:

The thing is, when you pull something shiny and new off the shelf, there can be operational overhead. If it breaks and you’re the only one who knows how it works, then it probably wasn’t a great technical choice. […] We want to plan for a world where stuff breaks all the time. And we want to make it so that when things break they matter a lot less, that they’re not critical. That they break and we can fix them and we can adapt and be resilient.

I, too, believe that being a bit more conservative and slowing down the pace would benefit our industry.

Skyliner, the AWS launch platform, is certainly a prime example of this philosophy:

Skyliner doesn’t use registries, scheduling, service discovery, virtualized networking, or any other advanced features. Instead, we use AWS services with proven reliability like S3, Autoscaling, and Elastic Load Balancing — services which have seen almost a decade of continuous use and improvement.
As software developers, we understand the allure of shiny new technologies, but ultimately we decided that we prefer the quiet satisfaction of sleeping through a night while on call. After all, we’re not just building a platform for our customers — we run our applications on Skyliner, too.

So, should we give up and stop using advanced container technologies altogether? Absolutely not. What we need, first and foremost, is a simple, boring container implementation that just works. More generally speaking, what we need are stable building blocks. Everything on top — our production systems — will flourish from there.

Docker and friends aren’t boring yet, but eventually they will be. I’m looking forward to that day.

P.S. This article first appeared on my Production Ready mailing list.

Like what you read? Give Mathias Lafeldt a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.