☁️Chaos Engineering for GeminiDB(for MySQL) on Huawei Cloud

Hakan GÜVEZ

Published in

Huawei Developers

6 min readDec 26, 2023

Introduction

Hello all, I’m going to explain the chaos engineering practical example of the GeminiDB for MySQL on Huawei Cloud.

Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions. It relies on concepts underlying chaos theory, which focuses on random and unpredictable behavior.

To give an example of system vulnerabilities:

In case a service crashes, incorrect coding of the configuration file is created for the replacement service,
Continuous sending of requests due to timeout values not being set properly
An outage occurs when the subcomponent on which the system depends receives excessive data traffic.
The collapse caused by a single error spreads to the entire system, etc. situations can be given as examples.

Before they have an impact on users of live applications, it is necessary to identify the most significant such issues. Therefore, in order to build trust in the system against unforeseen errors, a method must be used to manage any chaotic situation that may arise, regardless of how dispersed or complex the network may be.

How Did Chaos Engineering Come About?

When Netflix moved its services to AWS in 2010, there was no suitable tool for testing errors that might occur in the live environment. The main purpose of developing the test tool was to move from a software development model in which no errors would occur, to a model in which crashes would inevitably occur and software developers would develop with the awareness of how to fix these crashes. For this purpose, they developed the software called Chaos Monkey.

Define steady state: before running chaos tests, define what an ideal system would look like. For instance, with a web application, the health check endpoint should return a 200 success response

Introduce chaos: simulate a failure, something like a network bottleneck, disk fill, or application crash for example

Verify the steady state: check if the system works as defined in Step 1. Also, verify that the corresponding alerts were triggered via email, SMS, text, slack message, etc.

Roll back the chaos: the most crucial step, especially while running in production, is to roll back or stop the chaos that we introduced and ensure that the system returns to normal

Eight Fallacies Of Distributed Computing

A concise list of problems often encountered in DistributedComputing projects, by PeterDeutsch

1- The network is reliable.

2- Latency is zero.

3- Bandwidth is infinite.

4- The network is secure.

5- Topology doesn’t change.

6- There is one administrator.

7- Transport cost is zero.

8- The network is homogeneous.

How to Start Chaos Engineering Experiments?

👍 Known knowns: Information or knowledge that we are aware of and have evidence for.

🤔 Known unknowns: Information/knowledge gaps or risks that we are aware of.

🕸️ Unknown knowns: Information or knowledge that we are unaware of or are biased towards, also known as tacit or biased knowledge.

🕵️ Unknown unknowns: Information or knowledge gaps or risks that we are unaware of.

Reports that say that something hasn’t happened are always interesting to me because as we know, there are known knowns; there are things we know we know. we also know there are known unknowns; that is to say, we know there are some things we do not know. but there are also unknown unknowns — the ones we don’t know we don’t know. and if one looks throughout the history of our country and other free countries, it is the latter category that tends to be the difficult one.

Let’s move on to the practical example

Let’s talk about how to choose experiments on the GeminiDB for MySQL database. In this example, let’s say there is a cluster with 100 GeminiDB for MySQL servers and each server has multiple sharded partitions.

In one of the regions, we have a basic database server with two replicas, and the replication process is carried out semi-synchronously. In the other region, there is a pseudo version of the base server and there are two pseudo replicas:

👍 Known knowns:

We know that when a replica is shut down, it will be removed from the cluster. We are also aware that the removed replica will be cloned from the base server and added back to the cluster.

🤔 Known unknowns:

By looking at the log files, we know whether the cloning process occurred incorrectly or successfully. However, when an error occurs, we do not know the average weekly time it takes for the clone to be created and returned to the cluster.
We know that we will receive a warning message 5 minutes after a replica remains in the cluster as a result of closing the replica. However, we are not sure that the threshold time at which the warning message will be sent is perfectly correct to prevent errors more efficiently.

🕸️ Unknown knowns:

When we shut down two replicas at the same time during working hours on a busy Monday, we do not know how long it will take to clone and add to the cluster on average. However, we know that the pseudo servers and replicas in the other region will work without any problems.

🕵️ Unknown unknowns:

We don’t know what will happen when we shut down the entire cluster in Region A. Also, since we did not run this scenario, we do not know whether the pseudo region will clone itself to the first region.

We can create chaos experiments to run in the following order:

👍Known Knowns: One of the replicas is turned off and the measurement process begins. The time until the moment of closure, removing the closed replica, starting the cloning process, completing the cloning process, and adding the clone to the cluster is measured. Before starting this experiment, the number of replicates is increased from two to three. This closure experiment is carried out by selecting a regular number of replicas, but it should never be 0 (a situation where no replicas are selected). The average time taken to fix the replica shutdown error is reported, and the experiment is carried out by dividing it into days and hours, taking into account busy working hours.

🤔Known Unknowns: An attempt is made to find answers to the “known-unknowns” questions by using the result data obtained from the known-knowns experiment. In this way, the average time from the occurrence of the replica shutdown error to the addition of the clone to the cluster will be known on a weekly basis. In addition, it will be noticed whether sending a warning message after 5 minutes when the shutdown occurs is an appropriate threshold value.

🕸️Unknown Knowns: Before starting this experiment, the number of replicates is increased from three to four. Both replicas in the cluster are shut down at the same time. In the experiments, which are carried out every Monday during work and will last for several months, the time taken for two replicas to be cloned and replaced is measured. Thanks to this experiment, many unknown problems can be detected, such as the main cluster’s inability to handle the load incurred during cloning and backup. Thus, studies can be carried out for better management of replicas.

🕵️Unknown unknowns: Shutting down the entire cluster (the primary server and its two replicas) will require good engineering work. This error may occur unexpectedly in the real environment. However, since you are not yet ready to fix this bug, you need to prioritize the necessary engineering work before conducting such chaos experiments.

Conclusion

This article shows the chaos theory of GeminiDB for MySQL and explains the Known-Unknown Matrix.

If you have any thoughts or suggestions please feel free to comment or if you want, you can reach me at guvezhakan@gmail.com, I will try to get back to you as soon as I can.

You can reach me through LinkedIn too.

Hit the clap button 👏👏👏 or share it ✍ if you like the post.