You Can’t Have a Rollback Button

The internet is a big truck. It’s really hard to drive it backwards.

Dan McKinley
Skyliner
4 min readFeb 28, 2017

--

I’ve worked with deploy systems in the past that have a prominent “rollback” button, or a console incantation with the same effect. The presence of one of these is reassuring, in that you can imagine that if something goes wrong you can quickly get back to safety by undoing your last change.

But the rollback button is a lie. You can’t have a rollback button that’s safe when you’re deploying a running system.

The majestic bison is insouciant when monopolizing the push queue, stuck in a debug loop, to the annoyance of his colleagues.

The old version does not exist

The fundamental problem with rolling back to an old version is that web applications are not self-contained, and therefore they do not have versions. They have a current state. The state consists of the application code and everything that it interacts with. Databases, caches, browsers, and concurrently-running copies of itself.

What they don’t tell you in school is the percentage of your life as a working programmer that will be spent dealing with the “plus” sign.

You can roll back the SHA the webservers are running, but you can’t roll back what they’ve inflicted on everything else in the system. Well, not without a time machine. If you have a time machine, please use the time machine. Otherwise, the remediation has to occur in the direction of the future.

A demonstration

Contriving an example of a fault that can’t be rolled back is trivial. We can do this by starting with a python script that emulates a simple read-through cache:

We can verify that this works fine:

Now let’s consider the case of pushing some bad code over top of it. Here’s an updated version:

That corrupts the cache, and promptly breaks:

At this point, red sirens are going off all over the office and support reps are sprinting in the direction of our desks. So we hit the rollback button, and:

Oh no! It’s still broken! We can’t resolve this problem by rolling back. We’re lucky that in this case, nothing has been made the worse. But that is also a possibility. There’s no guarantee that the path from v1 to v2 and then back to v1 isn’t actively destructive.

A working website can eventually be resurrected by writing some new code to cope with the broken data.

You might dispute the plausibility of a mistake as transparently daft as this. But in my career I’ve carried out conceptually similar acts of cache destruction many times. I’m not saying I’m a great programmer. But then again maybe you aren’t, either.

A sharp knife, whose handle is also a knife

Adding a rollback button is not a neutral design choice. It affects the code that gets pushed. If developers incorrectly believe that their mistakes can be quickly reversed, they will tend to take more foolish risks. It might be hard to talk them out of it.

Mounting a rollback button within easy reach (as opposed to git revert, which you probably have to google) means that it’s more likely to be pressed carelessly in an emergency. Panic buttons are for when you’re panicking.

Practice small corrections

Pushbutton rollback is a bad idea. The only sensible thing to do is change the way we organize our code for deployment.

  • Push “dark” code. You should be deploying code behind a disabled feature flag that will not be invoked. It’s relatively easy to visually inspect an if statement for correctness and check that a flag is disabled.
  • Ramp up invocations of new code. Breaking requests without a quick rollback path is bad. But it’s much worse to break 100% of requests than it is to break 1% of requests. If we ramp up new code gradually, we can often contain the scope of the damage.
  • Maintain off switches. In the event that a complicated remediation is required, we’re in a stronger position if we can disable broken features while we work on them in relative calm.
  • Roll forward. Production pushes will include many commits, all of which need to be evaluated for reversibility when a complete rollback is proposed. Reverting smaller diffs as a roll-forward is more verifiable.

Complete deployment rollbacks are high-G maneuvers. The implications of initiating one given a nontrivial set of changes are impossible to reason about. You may decide that one is called for, but you should do this as a last resort.

Skyliner is an AWS platform for continuous delivery. We’re trying to build a straight jacket that you can wear to stop hitting yourself in the face. It’s free for personal use. Sign up today!

For a limited time we’ll port a Heroku app to AWS for you, for free.

--

--