Two types of engineering resiliency

There are two metaphors for resiliency and risk management in traditional engineering — the NASA way and the US Navy way. I’m not entirely sure how they reflect reality, given the story of Apollo Guidance Computer’s design, but they’re very valid to make the point.

  • NASA way is about provable systems, consistent engineering, repeatable practices and forbidding certain system behaviors that impose great risk, or working around it with systematic redundancy. It’s expensive, tedious, and mistakes are fatal.
  • US Navy was is all about keeping personnel, processes and tools in shape that helps manage SNAFU in the most efficient way with shortest time back to system’s stability.

When applied to modern computer engineering and security, we’ve barely been thorough enough to go full on NASA on many things, and too quickly jumped to US Navy way for things we’ve barely learnt to cope with. This is the paradox that kept me intrigued for years. NASA way makes design mistakes fatal, where US Navy way converts design mistakes into performance/expenditure penalty of some kind. The reason why computer (and specifically, security) engineering intuitively shifted to the latter is fairly obvious if you look at it from this model:

Fighting against laws of nature is far more analytical process than fighting against unknown, non-deterministic adversary.

We, humans, repeat decisions from one domain into another by the power of association.

But this has a few drawbacks.

  • Drawback 1. Most of uncertainty of modern engineering comes from shit engineering, not some magical source of randomness, and the more shit engineering we flood the market with, the less infrastructures and defenses are designable and more ad-hoc they end-up being. Cost optimization leads us to more JS programmers, but when we strive for reliability, we end up spending saved cost for massive online, on-call, holistic security engineering (and ops/infra as well).
  • Drawback 2. We often forget that US Navy still relies on plenty of advanced and reliable technology constructed the NASA way. The baseline for situational awareness is being able to get good data, and apply reaction tactics with predictable tools with predictable results.

That’s why ad-hoc and reactive defenses are the last mile of defense, and systems should at all times at least attempt to make some components verifiable and deterministic — to balance the uncertainty on it’s borders with predictability in it’s core.