Programming where Failure is Not an Option
When most programmers hear of systems where “failure is not an option”, they think of space shuttles or self-driving cars. I know nothing about space shuttles, but I suspect that what people really mean, is that space shuttle software does its very best to not result in death (and “very best” has a very strong “very”).
I am not a space shuttle programmer, I am a lowly Flash Developer so I only write IL for security-flaw-ridden-virtual-machine-software-blitters. Though statistically, perhaps I am “better” than space shuttle programmers at programming because so far my code has a 0% mortality rate.
In my domain, math libraries are probably most likely to be called systems that cannot fail. In games, math libraries cannot fail — well except, of course, when they can: like when you ask them to successfully divide by zero or slerp Euler rotations without gimbal lock. Putting on our meta-analysis hats for a moment, we realize that these things can only be considered “failures” if we accept that using them incorrectly is correct usage.
And that’s where this “failure is not an option” train starts to derail (and it IS a train), because it turns out that using things incorrectly is a very human thing to do. Give me a system and I promise you I can use it incorrectly. Write a system and the only certainty is that someone else will use it differently than you intended. And, to be fair, the Universe seems to do this too, which is why we invented ECC RAM so those freaking cosmic rays don’t affect our PUBG chicken dinners. Thus, when we’re writing, we have to think long and hard about the various failures of the systems we’re writing and the failures of the systems we’re consuming.
“Core systems cannot fail” — you know, like I/O. A core requirement for any kernel. It is very strange then that S3 replicates data across AZs or that proper usage of any I/O interface is to just try writing to it while catching errors that pop up.
The lesson here is that instead of trying to write software that cannot fail, we should generally be writing software where failure is well-defined. I’m not sure if this is a result or a distillation of Worse is Better. Out with it at the get-go: stop making developers that consume your code stand out in the middle of a field with a witching stick, trying to divine where the failure cases lie or what failure even is — and for the love of witching sticks, stop denying that failures exist.
Many may now think that I am recommending writing garbage code. How do we balance writing “good simple systems” against the fact that they can fail because of bugs or misuse or maybe (among other options) high energy protons and atomic nuclei?
I recently hung some wallpaper in our upstairs bathroom. The thing about hanging wallpaper is that sheets of wallpaper likely have a different topology than your wall. This is unfortunately a mathematical truth of isometries (at least, for “non-stretchy” wallpaper). The result is that you get wrinkles here and there. What do you do with these wrinkles then? Well first, you don’t ignore them. Then you group them together to make bigger wrinkles, then you move them to a spot where the wrinkles don’t matter so much— like perhaps behind the vanity, or under a light-switch.
This is how we should author software systems.
Instead of stamping out lots of small failure cases, cutting holes in our wallpaper — try orienting the system design around large failure cases that shake the smaller ones out. Then instead of fixing them, document them. Sometimes (don’t hate me), documentation IS the fix.
This is the wallpaper method of systems design.