Chaos Testing in Elasticsearch

Oğuzhan Erdem
Trendyol Tech
Published in
5 min readJan 13, 2022
Photo by Linh Ha on Unsplash

We as the search team at Trendyol, wanted to stress the production Elasticsearch clusters to find out system behaviour in the chaotic situations.
Let’s discuss what chaos test is and why we need it, also what we gained from this study.

A single web service handled all the work in applications in the early 2000s. Microservice architecture and application infrastructure have evolved into distributed structures where many services are connected to each other. In this situation, it has become challenging to predict what would happen in case of an unexpected mistake. On the other hand, these situations are rare but they have a devastating effect on the production environment.

Could it be said that more professional, larger, and better-funded teams have fewer failures? When the empirical data are scrutinized, the answer is clearly “no” in the outage report. These needs were the factors that led to the emergence of chaos test. Thus, these requirements were the factors that led to the emergence of chaos test. According to requisition, applying chaos test on the elasticsearch side of the project was a must for the sake of team. In this article, it will be shared the importance of chaos testing and the practices.

Outage report

Briefly, chaos engineering is chaotic experiments conducted to establish confidence in the system’s ability to survive against problematic situations that may occur in the production environment.

Expected Benefits Of Chaos Engineering

To mention the advantages of the chaos test, respectively;

  • Less downtime
  • Better customer experience
  • Less alarm
  • Less burnout for development teams

Chaos Engineering Steps

When performing a chaos test you basically do two steps;

  1. Define the normal/steady of the system: The hypothesis that steady state will continue in both control and experiment groups.
  2. Pseudo-randomly inject: Kill container, network etc. Try to disprove hypothesis looking for difference in control and experiment groups.
Steps of chaos test

Preparation of Chaos Test on Elasticsearch

  • Create dependent service documentation
  • Review documentation with SRE(Site reliability engineering) or DBA
  • Use production or create new production environment
  • Execute chaos steps
  • Analyse results of the chaos tests on real production environment
Example of dependent service document

Relation Type: it shows that the relation type of the part of the system. There are two relation types.The first relation type is direct(per request or cached) also, the second relation is indirect.

Possible Faults: Test cases of chaos test. It must be unlimited and include the worst scenarios.

What is our back plan?: These plans should be ready before the chaos test because of the expected incident situations in production.

How To Start?

The first step of chaos testing is making a decision for usable cases of elastic nodes with SRE because they can tell the possible chaotic situation of the elastic cluster. Next step should be to take down one of the production elastic clusters or create a replica of it to avoid down-scale. Finally, simulating traffic on the system with performance test tools is crucial to make the system close to production behavior.

The most fundamental key is to apply the chaos test on production environment. If it is impossible, replica of it can be used. Remember that below

“Only production is production”

Test Cases Of Chaos Test

1.Index and alias changes

  • When ES with single index:
    * Current index with single and correct alias, (Normal State)
    * Current index with two alias (Injected State)
    * Current index with wrong alias (Injected State)
    * Current index with empty alias (Injected State)
  • When ES with multiple indexes:
    * Alias in the current index (Normal State)
    * Alias in old index (Injected State)
    * Both indexes have the same alias (Injected State)
    * Missing alias in both indexes (Injected State)
  • New index & alias:
    * New index has an alias but the API is not deployed (Normal State)
    * New index hasn’t got an alias but the API is deployed (Injected State)
    * API is deployed when new index doesn’t exist (Injected State)

2. Data Faults

  • Index with data (Normal State)
  • Empty index (Injected State)
  • Index with wrong data (Injected State)
  • Index with missing data (Injected State)
  • Index with wrong mapping & setting (Injected State)

3. Node And Cluster Changes

  • Normal state of cluster (Normal State)
  • One cluster is down (Injected State)
  • One data node is offline (Injected State)
  • All data nodes are offline (Injected State)
  • One master node is offline (Injected State)
  • All master nodes are offline (Injected State)
  • ES downtime (Injected State)

Conclusions

Taking all these into consideration, it is clearly demonstrated that chaos testing gave fundamental informations about the chaotic situations of elastic clusters. Emergency backup plans could be made up according to the chaos test results. Moreover, it will increase the reliability of the system by detecting the chaos problem before it occurs as an outage on the production environment. In my opinion, It can be a problem even with very reliable systems. For this reason, making a backup plan by experiencing all kinds of behavior of the system with chaos test will be useful for finding the solution when chaotic situations occur.

Gizem Saruhan gave me the idea to write this article. We wrote this article with her. I am very grateful to Gizem for her contribution and patience. You can reach the wonderful articles written by Gizem from this link.

Thank you for reading.
Cheers!

References

Outage Report
Test Faster and Smarter by Testing in Production
Testing in production: Yes, you can (and should)
Software testing Blog — Awesome Testing: TestOps #2 — Testing in Production
Every Release is a Production Test
Testing in Production: rethinking the conventional deployment pipeline
Scientist: Measure Twice, Cut Over Once
Move Fast and Fix Things
Principles of Chaos Engineering
Breaking Things on Purpose
Chaos Engineering 101 — Production Ready
The Discipline of Chaos Engineering
A Primer on Automating Chaos
The Limitations of Chaos Engineering — Production Ready

--

--