2 min read
Next in trending

Fail fast

More than a few times, and especially after starting a new job at a young company, I've experienced reluctance at deploying during peak…


More than a few times, and especially after starting a new job at a young company, I've experienced reluctance at deploying during peak traffic, and in particular, during a traffic spike.

Not too long ago this happened during the launch of a site redesign. I used to think it was better to ship big changes during low traffic times, but I voted to ship. We'd done our best to test the changes and couldn't do any more given our infrastructure and we deployed. It worked and it was a great event.

When people want to introduce big changes off-peak it’s because the change is a risky design or feature change, or else because of a capacity cost introduced by the change.

I'm not qualified to speak to design or feature changes.

If you're deploying a change that requires more resources, the desire to deploy off-peak arises from the worry that your capacity is not enough for your peak traffic. The natural instinct is to make these changes off-peak with the goal of harming fewer users if things go awry, and then, if things look good, to breathe a sigh of relief.

However, this creates the “Deploy Sunday, Die Monday” risk. You may not detect a problem that only shows up in peak traffic for capacity reasons.

Assuming you've done all the performance testing and capacity planning you can afford, it's better to find out fast if it won't work during peak. It may harm more users than an issue you could detect off-peak, per second and for short period, but at this point testing in production is unavoidable, and the overall harm is more likely to be reduced.

In general, it's better to fail fast than to fail slowly, since in the latter case your responders are less likely to know what change explains the problem if the change has been in production longer. If you fail fast, you will have more engineers available to do what they must and they will do it more quickly, and the overall error rate will be lower than it would be after letting a new failure mode simmer slowly as traffic grows.

I hope this sounds obvious, but I've noticed that people often don’t think along these lines.