Improving Operational Resiliency through GameDay
Operational Resiliency (e.g. system resiliency and recoverability) is the ability to provide continuous service through people, processes and technologies in times of operational disruption. Downtimes for an operational system can be very costly and over time it makes users lose confidence in our system.
An effective way to measure operational resiliency is to simulate life-like system failure with calculated risks directly on production systems. Operational resiliency can then be evaluated based on i) system’s ability to automatically recover from failure, ii) processes that developers can follow to recover from failure and iii) team’s readiness to recover from operational failure.
What is GameDay
GameDay is a dedicated day for running chaos engineering experiments on our system with our team.
What is chaos engineering?
According to Wikipedia, chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions. We prefer to define it more simply — chaos engineering is like a vaccine we inject into our bodies to build immunity.
There is a common misconception that Chaos Engineering is just about breaking stuff in production. But it really is not.
There are a few benefits why running GameDay is useful.
Key benefits of GameDay
- To test operational resiliency on our system
- To refine the operation and support processes
- To improve response time to act on operational issue
We prepare for GameDay with the following 3 concepts in mind.
3 key concepts in chaos engineering
1. Gradually expand blast radius
Experiments need to be carefully scoped so that we do not cause undesirable outcomes that we cannot recover from. This is why we have grouped our experiments into 3 categories.
i. Application
Experiments that will impact only one service and can be easily recovered from.
ii. Network
Experiments that will impact multiple services and will take some time to recover from.
iii. Infrastructure
Experiments that will impact many services and that our system may not be able to recover from.
The main idea is that we want to start with something small, something that can be recovered from easily, build confidence, and then progress to experiments that have bigger impact on our system.
2. Having extensive observability is a must
To measure the impact of the experiment, it is best if we can monitor all our system metrics to observe what is happening during the execution of our chaos experiments. Our team collects 3 different kinds of metrics.
i. Log Aggregation
All our services are container-based and deployed using Kubernetes. In addition to our services, we deployed a Daemonset of Fluentd in each of our nodes in Kubernetes to forward our logs to ElasticSearch. Fluentd DaemonSet allows us to collect log information from containerised applications easily.
ii. Application performance
Most of our applications use Spring Boot. We use Micrometer to capture and pipe metrics Prometheus.
iii. System heartbeat
We use Metricbeat to collect and pipe system metrics into Elasticsearch to monitor our infrastructure health. Metricbeat uses “modules” to collect metrics. Each module defines the basic logic for collecting metrics from a specific service (e.g. MongoDB, RabbitMQ or Kafka).
3. Leverage emerging chaos engineering tools
Chaos engineering tools have emerged rapidly as chaos engineering is increasing in popularity. There is no need to re-invent the wheel and create all the tools ourselves. Chaos Monkey is the classic tool that everyone uses for chaos engineering and we have used the Chaos Monkey for Spring Boot (CM4SB) library to kickstart our journey into chaos engineering.
Here are some of the functionalities that CM4SB provides
CM4SB has watchers that look for Spring Boot components to “assault”.
To use the library, import the library as a dependency in Spring Boot and specify the chaos-monkey profile in the properties file.
spring-profiles.active = chaos-monkey
chaos.monkey.enabled = true
Latency Assault
This capability adds latency to our methods. It will cause the method and eventually the application to respond slower.
For our experiment, we set the following properties in the application properties file to add a 10–30 seconds latency on every tenth request to our rest controller.
chaos.monkey.enabled = true
chaos.monkey.watcher.restController = true
chaos.monkey.assaults.level = 10
chaos.monkey.assaults.latencyActive = true
chaos.monkey.assaults.latencyRangeStart = 10000
chaos.monkey.assaults.latencyRangeEnd = 30000
Exception Assault
This capability throws a random Runtime Exception once enabled.
For our experiment, we set the following properties in the application properties file to add a runtime exception on every tenth service request in our application.
chaos.monkey.enabled = true
chaos.monkey.watcher.service = true
chaos.monkey.assaults.level = 10
chaos.monkey.assaults.exceptions-active = true
chaos.monkey.assaults.exception = "java.lang.RuntimeException"
KillApp Assault
This capability kills an application at the interval indicated.
For our experiment, we set the following properties in the application properties file to kill the application every minute.
chaos.monkey.enabled = true
chaos.monkey.watcher.service = true
chaos.monkey.assaults.level = 10
chaos.monkey.assaults.kill-application-active = true
chaos.monkey.assaults.kill-application-cron-expression=0 */1 * * * *
After we configured and start the Chaos Monkey, we use our monitoring stack (explained previously) to monitor the health and performance of our application.
We have now shared our 3 key concepts in chaos engineering. The following sections will cover how we have conducted our GameDay, on our process and roles.
Our process
We added an additional step (“Prepare experiment”) to the standard chaos engineering process for our use case. This is to prepare for GameDay.
Step 1: Steady State
Under normal circumstances, how our system behaves (e.g. response time, ETL processing time, through-put) determines the steady state.
Step 2: Hypothesis
Make a hypothesis on a “what if” question (e.g. What if the network latency in the system increases by five seconds? How would our system behave?).
Set only one variable (what if) for each experiment.
Step 3: Prepare Experiment
Design the experiment and identify key outcomes based on the desired steady state and hypothesis. Prepare all the necessary steps and resources needed beforehand to ensure smooth execution on GameDay. Order the sequence of experiments based on increasing impact on our system.
Step 4: Run Experiment
Execute the experiment.
Step 5: Verify
Verify the outcome of the experiment against the steady state. Any variance from or disruption to the steady state would disprove the hypothesis of the experiment.
Step 6: Improve
If the verification disproves the hypothesis, improve our system to make it more resilient.
Roles and Responsibilities
To ensure the smooth execution of GameDay, we give clear roles and responsibilities to everyone involved.
Template
We came up with a template specifically for our organisation. It helps everyone know what they are supposed to do.
Conclusions
Unit tests and integration tests are not enough to build a resilient system.
Unit tests and integration tests can validate the functionalities of a system, but not how it behaves in production environment. With chaos engineering, we are able to conduct experiments that simulate events that will happen in production (e.g. network outages).
First GameDay, but it will not be the last
GameDay has given us many insights on how our system behaves in production environment. It has allowed us to build a better and more resilient system after knowing the weaknesses that exist in our system. Our team agrees that running GameDay is beneficial for us to build a more resilient system. Hence we are looking forward to more GameDay in the future.
Inject chaos regularly, sleep better at night — Grelim