Observer of the world, climber of the rocks. https://aws.amazon.com/evangelists/adrian-hornsby/ Opinions are my own.
… that it is the very goal of chaos experiments to reveal such weaknesses, and that’s exactly right. However, having a rollback/revert button promising to quickly get back to safety is, strictly speaking, a scam. That isn’t to say we should do away with these safeguards entirely. It just means we should stop implying that all, or even most, actions can be quickly reversed, which may cause engineers to take more risk than necessary.
If the faults that are injected (even at random) are handled in a transparent and graceful way, then they can go unnoticed. You would think this was the goal: for failures not to matter whatsoever when they occur. This masking of failures, however, can result in the very complacency that they are intended (at least should be intended) to decrease. In other words, when you’ve got randomly generated and/or continual fault injection and recovery happening successfully, care must be taken to raise the detailed awareness that this is happening — when, how, where, etc. Otherwise, the failures themselves become another component that increases complexity in the system while still having limitations to their functionality (because they are still contrived and therefore insufficient).
… all of your reliability concerns, and it never will be. It’s merely one of many approaches used to gain confidence in system correctness (typically in the face of perturbation). Consider it required but not sufficient. And by no means is it — or should it be — the only way to…