Does this sound familiar?
“Oh, fiddlesticks! There’s a problem in prod! I need to fix it right away!
Good news: I know where the bug is. So I’ll just make this teensy little fix.
No time to run tests… it’s such a small change, I’ll just push it to prod.
And: 3, 2, 1…
…it seemed like such an inconsequential change, but it actually made things worse. I’ll admit it: I’ve done this many times. I panicked, and using poor judgment, I convinced myself I could force-push. This is how I’ve managed to turn UI bugs into crashloops, and crashloops into data-corruption issues. However bad the problem, I’ve found a way to exacerbate it.
What happened? Maybe my deploy included a commit I didn’t know about. Or maybe some credentials changed. Or the full deploy process included a step I forgot about. Whatever the case, I’ve now gone from an “emergency” to an emergency. The problem got worse, and I’ve lost credibility. Or maybe I didn’t add any new problems, but my fix just didn’t work. Either way, I’ve wasted precious time and the people around me aren’t happy.
This is why we have tests.
Humans are fallible. Our motivation for writing automated tests is to have high confidence that every deploy is safe and reliable. So today’s successful software teams typically cover their code thoroughly with tests, and use Continuous Integration (CI) systems to run all those tests on each new application version, before that version may be deployed. That’s how we know that our deploys are safe.
Except, that the tests take so long. In an emergency, time is money! We can’t afford to twiddle our thumbs while the CI system runs a million regression tests that we’re pretty sure are going to pass anyway.
Can’t we just skip them this one time?
No. This is why we have tests! An emergency deploy is exactly when we need our tests the most. We should run every test for every deploy. Of course, that’s a big “should,” and it’s easier said than done. Fully testing even a small change may take many hours, and every second of downtime can mean lost revenue. How do we find the balance?
Stop skipping tests. Here’s how:
- Establish a new normal. You know — and your stakeholders know — that you can’t literally fix problems instantly. It takes at least the time to push out a code change. So let’s change the expectations around what’s possible: include testing time in your understanding of “immediate” deployment.
- Make an emergency plan. That said, it may be unacceptably slow to run all of your tests during an emergency. So plan ahead: create a deployment pipeline that has the optimal tradeoff between test coverage and speed. Test this pipeline using known-good code, to be sure the pipeline itself is correct. Don’t rely on some back-door hack to make an emergency push.
- Roll back. Fast. The best fix is the “undo” button. If you have confidence in your rollback procedure, invoke it immediately, and figure out the problem at your leisure. Don’t have confidence in your rollback procedure? Invest in a Continuous Delivery (CD) process that makes it easy to rollback code — and infrastructure. Test the procedure regularly, to be sure it will work when you need it.
- Know what’s in prod, and fix only that. Sometimes you think you know which version is in prod, so you patch that and deploy it. Unfortunately, it turns out you deployed a patch to the wrong version, containing some other stuff that doesn’t work. So it’s important to be able to inspect production systems to get their version number (which could be a semver tag, or, better yet, a commit hash). Then you can check out that exact code and make a precise fix.
- Do a post-mortem. As with all things DevOps, reflection drives continuous improvement. After a failure, or a failed fix, have an open discussion about what went wrong and how to make it better. This discussion is known as a “post-mortem” (alternatively, “retrospective” or “learning review”) and it works best when framed in terms of building understanding, rather than assigning blame. It’s an invaluable opportunity to learn and grow as a team.
Have you ever skipped the tests, and paid a price for it? (Or not? Did you get lucky?) Tell your tale of woe in the comments!
Join our community Slack and read our weekly Faun topics ⬇