Improving Operational Resiliency through GameDay

Published in

CSIT tech blog

6 min readJun 16, 2022

Operational Resiliency (e.g. system resiliency and recoverability) is the ability to provide continuous service through people, processes and technologies in times of operational disruption. Downtimes for an operational system can be very costly and over time it makes users lose confidence in our system.

An effective way to measure operational resiliency is to simulate life-like system failure with calculated risks directly on production systems. Operational resiliency can then be evaluated based on i) system’s ability to automatically recover from failure, ii) processes that developers can follow to recover from failure and iii) team’s readiness to recover from operational failure.

What is GameDay

GameDay is a dedicated day for running chaos engineering experiments on our system with our team.

What is chaos engineering?

According to Wikipedia, chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions. We prefer to define it more simply — chaos engineering is like a vaccine we inject into our bodies to build immunity.

There is a common misconception that Chaos Engineering is just about breaking stuff in production. But it really is not.

There are a few benefits why running GameDay is useful.

Key benefits of GameDay

To test operational resiliency on our system
To refine the operation and support processes
To improve response time to act on operational issue

We prepare for GameDay with the following 3 concepts in mind.

3 key concepts in chaos engineering

1. Gradually expand blast radius

Experiments need to be carefully scoped so that we do not cause undesirable outcomes that we cannot recover from. This is why we have grouped our experiments into 3 categories.

i. Application

Experiments that will impact only one service and can be easily recovered from.

ii. Network

Experiments that will impact multiple services and will take some time to recover from.

iii. Infrastructure

Experiments that will impact many services and that our system may not be able to recover from.

The main idea is that we want to start with something small, something that can be recovered from easily, build confidence, and then progress to experiments that have bigger impact on our system.

2. Having extensive observability is a must

To measure the impact of the experiment, it is best if we can monitor all our system metrics to observe what is happening during the execution of our chaos experiments. Our team collects 3 different kinds of metrics.

i. Log Aggregation

All our services are container-based and deployed using Kubernetes. In addition to our services, we deployed a Daemonset of Fluentd in each of our nodes in Kubernetes to forward our logs to ElasticSearch. Fluentd DaemonSet allows us to collect log information from containerised applications easily.

ii. Application performance

Most of our applications use Spring Boot. We use Micrometer to capture and pipe metrics Prometheus.

iii. System heartbeat

We use Metricbeat to collect and pipe system metrics into Elasticsearch to monitor our infrastructure health. Metricbeat uses “modules” to collect metrics. Each module defines the basic logic for collecting metrics from a specific service (e.g. MongoDB, RabbitMQ or Kafka).

3. Leverage emerging chaos engineering tools

Chaos engineering tools have emerged rapidly as chaos engineering is increasing in popularity. There is no need to re-invent the wheel and create all the tools ourselves. Chaos Monkey is the classic tool that everyone uses for chaos engineering and we have used the Chaos Monkey for Spring Boot (CM4SB) library to kickstart our journey into chaos engineering.

Here are some of the functionalities that CM4SB provides

CM4SB has watchers that look for Spring Boot components to “assault”.

To use the library, import the library as a dependency in Spring Boot and specify the chaos-monkey profile in the properties file.

spring-profiles.active = chaos-monkey
chaos.monkey.enabled = true

Latency Assault

This capability adds latency to our methods. It will cause the method and eventually the application to respond slower.

For our experiment, we set the following properties in the application properties file to add a 10–30 seconds latency on every tenth request to our rest controller.

chaos.monkey.enabled = true
chaos.monkey.watcher.restController = true
chaos.monkey.assaults.level = 10
chaos.monkey.assaults.latencyActive = true
chaos.monkey.assaults.latencyRangeStart = 10000
chaos.monkey.assaults.latencyRangeEnd = 30000

Exception Assault

This capability throws a random Runtime Exception once enabled.

For our experiment, we set the following properties in the application properties file to add a runtime exception on every tenth service request in our application.

chaos.monkey.enabled = true
chaos.monkey.watcher.service = true
chaos.monkey.assaults.level = 10
chaos.monkey.assaults.exceptions-active = true
chaos.monkey.assaults.exception = "java.lang.RuntimeException"

KillApp Assault

This capability kills an application at the interval indicated.

For our experiment, we set the following properties in the application properties file to kill the application every minute.

chaos.monkey.enabled = true
chaos.monkey.watcher.service = true
chaos.monkey.assaults.level = 10
chaos.monkey.assaults.kill-application-active = true
chaos.monkey.assaults.kill-application-cron-expression=0 */1 * * * *

After we configured and start the Chaos Monkey, we use our monitoring stack (explained previously) to monitor the health and performance of our application.

We have now shared our 3 key concepts in chaos engineering. The following sections will cover how we have conducted our GameDay, on our process and roles.

Our process

We added an additional step (“Prepare experiment”) to the standard chaos engineering process for our use case. This is to prepare for GameDay.

Step 1: Steady State

Under normal circumstances, how our system behaves (e.g. response time, ETL processing time, through-put) determines the steady state.

Step 2: Hypothesis

Make a hypothesis on a “what if” question (e.g. What if the network latency in the system increases by five seconds? How would our system behave?).

Set only one variable (what if) for each experiment.

Step 3: Prepare Experiment

Design the experiment and identify key outcomes based on the desired steady state and hypothesis. Prepare all the necessary steps and resources needed beforehand to ensure smooth execution on GameDay. Order the sequence of experiments based on increasing impact on our system.

Step 4: Run Experiment

Execute the experiment.

Step 5: Verify

Verify the outcome of the experiment against the steady state. Any variance from or disruption to the steady state would disprove the hypothesis of the experiment.

Step 6: Improve

If the verification disproves the hypothesis, improve our system to make it more resilient.

Roles and Responsibilities

To ensure the smooth execution of GameDay, we give clear roles and responsibilities to everyone involved.

Template

We came up with a template specifically for our organisation. It helps everyone know what they are supposed to do.

Conclusions

Unit tests and integration tests are not enough to build a resilient system.

Unit tests and integration tests can validate the functionalities of a system, but not how it behaves in production environment. With chaos engineering, we are able to conduct experiments that simulate events that will happen in production (e.g. network outages).

First GameDay, but it will not be the last

GameDay has given us many insights on how our system behaves in production environment. It has allowed us to build a better and more resilient system after knowing the weaknesses that exist in our system. Our team agrees that running GameDay is beneficial for us to build a more resilient system. Hence we are looking forward to more GameDay in the future.

Inject chaos regularly, sleep better at night — Grelim