How to Start with Chaos Engineering Experiments?

The acceptance criteria you should be using for chaos engineering experiments

Sérgio Martins
Onfido Product and Tech

--

House completely consumed by flames.

Chaos engineering experiments can be highly exploratory at the beginning.

You often have low to no clue of what to look for other than the possible unavailability of the system.

Let me give you a hand, and provide you with the acceptance criteria to get you started:

  • Downtime — Are you permitting downtime during chaos engineering experiments? Perhaps you should. But not for all of the services. Thus, identify the critical and non-critical services and establish their allowed downtime. As a hint, you may want to consider vital services the ones to be directly related to revenue generation.
  • Service degradation — How much time of a degraded service is acceptable? ie. Mean time to repair (MTTR) to default performance. Separate service degradation into two categories. The latency and the error rate. Be accountable for spikes when initiating the experiment, but ensure that you stipulate a maximum allowed time for the MTTR of your services.
  • Data loss — What is your recovery point objective (RPO)? Are you considering data loss acceptable during chaos engineering experiments? Pay particular attention to requests that…

--

--

Sérgio Martins
Onfido Product and Tech

Hey, my name is Sérgio, and I’m a Senior Software Engineer by trade. Here you’ll find short and straight-to-the-point articles related to my craft, and business