Software Fail-Safes: How Your Code Can Suck but It Doesn’t Matter
I’ve been surprised over the years how many software developers don’t get how putting fail-safes in their code is good thing. Many perceive it as purposely allowing bugs or supporting an architecture that is sub-par. But I find it to be necessary to build a system that is resilient when things fail. Without them you falsely try to build the perfect system which will never happen in real life.
For me fail-safes are an admission that my code will suck at some point even if I try my hardest, so I need to put some safety nets in place to catch me when I’m stupid.
Here is a typical scenario that often rubs developers the wrong way:
Manager: We found a bug!
Developer 1: Let’s find the root cause and fix it.
Developer 2: The root cause is resource A hangs when we call it on Ground Hog’s Day sometimes. It will probably a take a lot of time to debug and fix resource A.
Manager: Release now!
Developer 2: Well even if we fix the root cause, this could fail in other ways and have the same result. Why don’t we write a timeout that kills the call to resource A and retries, then we will be covered in this scenario and network failure scenarios.
Developer 1: Our code and architecture must be prefect!
As a developer it feels sleazy to allow such a glaring imperfection and then do something that just sweeps it under the rug. Should you fix the root cause? Yes, but you can often find fail-safes that cover a multitude of problems that you know of and don’t know of and allow you to move forward towards a deadline. To code in reality means you will have to accept some imperfections in your code and that your coding and architecture skills may fail you. In the long term in this scenario, yes you should fix the root cause, but you should also leave the fail-safe in for unexpected errors. In the short term the fail-safe gives you some breathing room to prioritize things.
This scenario could also shift the other way if the failure was more often. In this case a timeout would still help but might be such a high performance penalty all the time that you have to fix the root cause. However, it would still cover you in unexpected situations.
Another pattern I often follow is do what is absolutely necessary for the user facing transaction and then ship everything else off to a background task. The transactional code I then consider more important and may even test more than the background code. If background code fails, the time frame for fixing something is much more expanded and the user often will not notice. Again many developers consider this as inviting bad design. However, when everything is important, nothing is important. So isn’t it better to prioritize the most valuable part of the code? In this case the fail-safe is making the transactional code as small as possible so that failures do not effect user experience as much.
Next time you find yourself in a situation where the answer is “we need to be better developers”, “test more”, or “we need to fix the root cause”, ask yourself if there is fail-safe which will allow you to be bad? Maybe you should only be bad for a while or maybe like the Virtual DOM it is something that allows you to be bad all the time. In the real world we need levers that allow us to prioritize and give us the flexibility to act badly. You won’t be on your A game 100% of the time so find ways to add slack.