Did we break production?
How to solve and prevent applying urban legends (…wait, urban what?)
I guess it’s not just me, and you feel that cold sweat on your back too with only reading this title. No software developer wants to break production… but the odds are that it will happen to all of us…and not just once.
So this is the story about how we temporarily killed our project.
Sometimes deployments act like printers; did you hear the one about “don’t let your printer know you’re in a hurry”?… This is the same feeling I get when deploying just before bank holidays. This story is about the day before Thanksgiving.
Act one: the explosion.
In CoverWallet we follow an agile development flow.
We develop in a local environment that mimics as accurately as possible our production environment. We create pull requests to be reviewed by our teammates, once approved (and if our feature passes our local testing), we move to a beta environment.
In beta we retest, and if everything works as expected we assign the task to the QA team. They will execute regression tests, and check the feature with a fresh set of eyes, unbiased by our own expectations as developers about how the code should behave.
Once approved, we move to production, retest, and QA again checks if everything is running properly.
So… let’s go back to the day before Thanksgiving… Imagine you finished your feature few days ago (meaning: you tested it locally, moved to beta, QA team tested, and everything looks great). Time to move to production! The deploy finished, all tests passed, everything looks fine.. but you navigate to your url, and nothing is loaded. Totally empty screen. I mean: we don’t even have our 404 error page. Silence on the other side.
…Can I panic now? (Spoiler: NO).
Ok, something failed, how do we fix it?
We’re all in the same boat.
Before starting with it, is your team aware of what’s happening? Is anyone else already working on a fix? Go to your main tech channel and communicate the problem:
Hey team, I’m seeing production is down. Is someone looking into it?
Add a screenshot, the services impacted, direct links to logs/dashboards, and any other information that would make it easier for someone to understand what is happening right away.
1. Narrow down the error.
Check logs (if you don’t track errors in your project, you should get on this. Sentry.io, anyone?). This is what our teammate did, but error logs were not related to deployed code, probably cascade failure created by our new feature.
2. Get some eyeballs on it.
Ask for help; the sooner, the better: At the moment you cannot quickly locate the error, you need additional teammates checking (we’re talking about production not working at all, here!). In a few minutes, we were a totally improvised team of developers, DevOps, and QAs checking… but the situation was: our servers were operational, the feature worked in all testing environments, and the logs did not provide any clues.
3. Keep calm… could you git revert?
This is the strategy: You need to assess the effort of finding and fixing the error immediately. If you believe that you cannot do it easily, and if you can roll back to restore a working environment, then revert and keep investigating later. Safer, quicker.
Act two: the fire wasn’t extinguished.
We chose to revert code and investigate without hurries. But… new problem, once rollback is finished, the project keeps being dead.
And now… can I panic? (Spoiler: again, NO).
The problem wasn’t coming from our last deployment. Let’s repeat the cycle:
1. Narrow down the error.
Same error on logs. Once reverted, this arises to the table again: does it make sense now, with the current state of code?
2. Get some eyeballs on it.
While investigating, we kept discussing what we are testing in our public tech channel, altogether. This allows everyone (also Product Managers!) to be aware of the situation, while the team stays focused on their investigation, unbothered by nervous people constantly asking for status updates.
Also, this is a great opportunity for everyone to follow the research and add new proposals in case they have fresh ideas! Never underestimate clear and direct team communication.
3. Keep calm… could you git revert?
Let’s recap: when our teammate warned about non-working prod after their deployment, we reverted it, but found it kept failing. We’re at deployment status -1.
In CoverWallet we use a service-oriented architecture, which means we have interconnected microservices. If we revert to a very old step, we risk having a situation where our failing service doesn’t reconcile with the other working services.
At this point, we analyzed code again, and our error logs made sense now: an outdated call to a service was added in this current status -1, and nothing was arriving!
We decided to revert this situation too, to a 3 deployments-status before, when this variable wasn’t used: this is status -2.
Act three: the urban legend.
It worked! Our project was running again, so… now it’s time for the urban legend!
Did you know that the Chinese word for crisis is also used to refer to opportunity?
Well, I’m sorry to tell you this is a mistake…
…but it relies on a principle we try to keep as a core value in CoverWallet:
…something failed? Let’s learn from it (and improve our developers and flows)!
Only those who never do, never make mistakes.
Keep in mind this was an investigation to understand what happened and prevent it from occurring again, … but hey, errors happen! We’re only humans, after all, and we all make mistakes, even with the best planification, flow, and testing… we fail. But learning from it is the key to improvement.
See the wide context.
We’re currently working on a new project: different teams integrating new features, new technologies, renewing our stack, and making great improvements! It is such an amazing time to work on it, but this also means a lot of people are coding on the same parts of the project.
Then… where did the error happen?
A few features ago, when merging branches, we removed variables that weren’t used anymore. Later, one of them was added again, calling to a service that no longer provides it. Then, why add it? Here it comes…
Healthy updates on your projects.
Good practices are trying to keep your working environment as similar to production as possible. This means you update your code to mimic production in every movement:
- We start updating our local master branch (to replicate production). From this state, we create our working branch.
- Once our code is approved and local tests passed, we update our working branch with master, and only then we merge to beta branch.
- Once the feature is approved by QA in beta, we update again our working branch with master, then merge it to master branch and to deploy to production.
But why so many updates?
Well… in CoverWallet we’ve grown a lot in the past few years. Now we’re a big group of developers implementing different features, from different time zones! By the time you have your code tested and ready to merge, your branch will surely be different from what is in production, and you will need to incorporate those changes to keep your partners’ code working.
Our failure happened because we kept variables outdated with master. When merging code, if there are changes on the same parts for both merged branches, Git doesn’t know how to solve it properly, and a conflict is raised to prevent failures, making developers check it by themselves.
By wrongly solving a conflict, we’re adding bad code to the final result. That’s all, folks!
But wait, wasn’t it tested before going live? How did it arrive to prod if not working…?
Well, take a seat: the truth is that it worked… in its environment. Let’s recap, again: we’re on deployment-1, merged to production, no warnings about problems. How? This happened because this code was intended to affect only the beta environment, then it was beta-tested and wasn’t going to be production-tested until we disabled the feature flag that prevented it to be shown on production. But we had a bad merge here, then by mistake, it dragged variables that shouldn’t exist on production, causing the error.
Looking for the flow to improve.
This is why in CoverWallet we make post-mortems and this is how we learn for the future.
In Lannisters team (which I’m part of, and where all this happened), we’ve increased our process documentation to fill in any knowledge gap. As we saw before, we’ve grown a lot lately, we have teammates coding from different tech stacks, hence we tried to make the context wider. In brief, this is our strategy:
- Healthy updates!
Always keep your code updated with the master branch before starting to code or moving to a testing/production branch. You will never do something like “updating too much”. Keeping your code updated will help you and your teammates avoid problems. You don’t only need to make your code working, but also your colleagues.
If, when merging, you find a conflict you’re not sure how to solve, ask for help! No one has the context to understand all features, so there’s no problem in raising a hand. We’ve all needed it!
2. Plan your testing!
Never start coding without understanding how you’ll locally/beta test your work. If you don’t know how to test, you don’t know if you’re introducing bugs, and in no situation, you should move to test branches before testing it first yourself.
3. Look after your baby!
Never leave a deployment unattended: wait until it’s finished and test to make sure the environment is responding as expected.
4. Ask for help!
Did it fail? Don’t panic! Check logs and openly speak about it -we have specific tech channels to ask for help. We can benefit from this enriching multidisciplinary environment and gain knowledge from our teammates!
5. Improve documentation!
We’ll keep on facing hard situations and failures in the future, but the key is to learn from them by always improving our docs! Who knows, maybe that documentation other teammates are creating about their last problem could save you from a rough moment in a few months…
…you know, a Lannister always pays his debts, even when related to improving flows ;)
Thanks to Guy Silva, Marcus Venable, and Pascale Abou Moussa.