How to Start with Chaos Engineering Experiments?
The acceptance criteria you should be using for chaos engineering experiments
Chaos engineering experiments can be highly exploratory at the beginning.
You often have low to no clue of what to look for other than the possible unavailability of the system.
Let me give you a hand, and provide you with the acceptance criteria to get you started:
- Downtime — Are you permitting downtime during chaos engineering experiments? Perhaps you should. But not for all of the services. Thus, identify the critical and non-critical services and establish their allowed downtime. As a hint, you may want to consider vital services the ones to be directly related to revenue generation.
- Service degradation — How much time of a degraded service is acceptable? ie. Mean time to repair (MTTR) to default performance. Separate service degradation into two categories. The latency and the error rate. Be accountable for spikes when initiating the experiment, but ensure that you stipulate a maximum allowed time for the MTTR of your services.
- Data loss — What is your recovery point objective (RPO)? Are you considering data loss acceptable during chaos engineering experiments? Pay particular attention to requests that are in flight. Thus, if the RPO is zero, ensure that these requests in flight are resolved after service restoration (guaranteeing no loss of data integrity).
- Incident response — Do you require manual intervention to respond to failures? Or are all of the mechanisms automated? Ensure it is clear when a chaos engineering experiment needs manual intervention for services to recover, even after the maximum allowed time for MTTR. Monitoring should be in place to support you in recognizing if these failures are persisting and what is triggering them.
- Repeatable chaos testing —Is the experiment you’re running repeatable? It most likely should, especially since these experiments are usually manually intense. Consider performing chaos experiments “on-demand” with automated playbooks that any engineer can trigger to avoid the “siloed engineering” knowledge practices. Furthermore, the experiment results should be published and documented in a…