You can’t avoid surprises but you can prepare for them

Rudy Winnacker
Operations Engineering
3 min readApr 2, 2013

--

“What’s the worst that could happen?” is a rhetorical question I've heard frequently. I grew up learning that it was a form of reassurance. A way of saying, “Let's do this thing, folks, it's awesome!” or, put differently, “What could go wrong?”

For this reason the colloquialism has been a useful way of introducing coworkers to the peculiar mindset of the production engineer who carries the pager and operates the servers. The typical pattern is this: a developer comes up with an idea for a wonderful feature and produces it, enthusiastically wanting to put it into production. If we're lucky, it then goes through a review process which is assumed to be a matter-of-course rubber stamp. So, as the would-be stamp I ask, “What's the worst that can happen?”. There are smiles and no answers, indicating that this has been received as a rhetorical question. I indulge myself in a few seconds of delay before asking, “No. Seriously, what's the worst that can happen?”, then follow this with something like, “You are changing the interface to the data. So you could delete or update all the data. What would we do if that happens?”

Especially with newer developers, this is generally followed with a blank stare and silence while the possibility of failure is considered. This can contribute to the professional development of the programmer if it helps him intellectually step away from his own code and look at it from a distance for weaknesses. Here are some other typical follow-up questions:

  • “You're creating a new user-input vector. Are you filtering that input for cross-site scripting or other bad behavior?”
  • - “You're implementing server-side processing to produce a new list of results. How much CPU overhead could be required for large inputs?”
  • “You're changing the way we use the cache. Have you considered the size of your objects, the rate of requests, and the limits set on the system?”

Practically, the best case scenario is not for the developer to implement his change perfectly. Perfection is unlikely. The best case scenario is for him to find out what he’s missed after going through this review, and ideally, discovering how to correct what has been missed by more testing in a safe environment.

If it isn’t already in place, this correction invariably involves creating monitoring systems and the ability to report the resulting measurements in a way that exposes in a very obvious way when something has gone wrong, and where in the system the failure happened.

I’ll admit without pride that there is nothing like taking the site down to teach the importance of thinking through worst-case scenarios. Even if this happens in production when your coworker has deployed his awesome feature, if you’ve gone through this exercise you've helped him stay in good stead by giving him the ability to say he had thought through the question, “What’s the worst that could happen?”

Thanks to: @jeremy, @braindentist, @dpwenger, @arcdoc, @miekow

--

--

Rudy Winnacker
Operations Engineering

Operations engineer, formerly with: Twitter, Google, Blogger.