Causing Chaos

Matthew Moon
Just Eat Takeaway-tech
8 min readNov 14, 2022

At Just Eat we’re always looking at ways to improve. Chaos engineering is something that has been around for a while and here at Just Eat we’ve been looking into how we can incorporate it into our ecosystem. But why do Chaos Engineering? What are the benefits?

Chaos Engineering can be defined as the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions.

Chaos Engineering 101 — Chaos Engineering is not about causing Chaos!

At the end of the day all systems are user-centric and what users want is reliability and high feature velocity. Getting the balance right between these two things is what builds trust and confidence in your product and helps with customer retention and ultimately building a successful business.

So how does Chaos Engineering help to deliver the dream of achieving this illusive perfect balance of reliability and features? It uncovers “Dark Debt”.

Dark debt is found in complex systems and the anomalies it generates are complex system failures. Dark debt is not recognisable at the time of creation. Its impact is not to foil development but to generate anomalies. It arises from the unforeseen interactions of hardware or software with other parts of the framework. There is no specific countermeasure that can be used against dark debt because it is invisible until an anomaly reveals its presence.¹

Dark Debt can be found in every part of a system (see table below) and is analogous to Dark Matter. Dark matter has detectable effects on the world but cannot be seen or detected directly. Matter that can be seen and measured directly accounts for only about 15% of the mass of the universe; the remaining 85% is dark matter.

Parts of the system where Dark Debt can be found.

What must be understood about modern software systems is that they are extremely complex and in most cases chaotic. The Cynefin diagram, which is used within the Cynefin decision making framework, illustrates this point. For most systems of some complexity which operate in a micro services environment it is desirable to reach the complex region, if you reach the Complicated region you can consider yourself either outrageously successful or delusional. We must realise that as the size and complexity of a system increases its rate of failure also increases as new Dark Debt is unavoidably added to the system meaning there is an ever-growing invisible force pushing you towards Chaos.

Cynefin diagram

Something which should be understood about most humans is that we like to introduce complexity into a system. There are a multitude of reasons for this including misunderstanding the problem and exploration, or others may suggest more sinister reasons such as ego and job protection. Although these are all potential reasons for overly complex systems, the main reason why we tend to see over engineered systems is because we have an innate desire to create and solve puzzles, something which can be exacerbated by the fact that we are bored of the business problems and we want to make our work more interesting!

How does Chaos Engineering Help us Make Sense of this Chaos?

The key thing about chaos engineering is that it is an opportunity to learn in a more controlled environment. Typically most companies are reactive to Dark Debt, what Chaos Engineering allows you to do is to become proactive, therefore you uncover these issues before your customers do!

Let’s consider the following learning loops:

This is a typical learning loop for most companies. The learning normally happens after an outage has occurred, or does it? In most cases an outage causes panic and often a knee-jerk reaction which doesn’t always lead to the best environment to learn and ultimately to the optimal solution to a problem.

This is the learning loop created by Chaos Engineering. When we do Game Days or automated chaos experiments we are doing so in a controlled environment. People will still be on-edge and you want that to be the case as you want them to feel the pain that Dark Debt creates as this will give them “skin in the game” and will ultimately make them care about the issues which surface. However people can also feel somewhat protected by the Chaos Engineering bubble which should lead them to better solutions to the problems they face.

Game Days

Game days are where we put all of our theory into practice. They don’t actually require any tooling, just bravery! However there are a number of prerequisites.

How do you know your system is normal?

Without this, you cannot effectively do chaos engineering. You must be able to tell that your system is normal as this is how you are able to come-up with a “steady-state-hypothesis”.

Things which can be used to measure are :

Hypothesis Sourcing

Hypothesis sourcing does exactly what it says on the tin, it sources our hypotheses. Ideally this is done during a team whiteboard session where all possible ideas of how a system could possibly fail are put on the table. This generates what is known as your “Hypothesis Backlog”.

An example of a hypothesis is as follows:

“I believe the system will survive if the database slows down”

It is useful to quantify the value of these using “Ilities”, here is a list which can be used (adding more which are unique to your team is advisable) :

  • Availability
  • Durability
  • Maintainability
  • Profitability
  • Recoverability
  • Security
  • Visibility

Based on the results of your whiteboard session you should then have a prioritised list of hypotheses which will give you the basis for your game days.

Running a Game Day

Once you have your hypothesis in-hand and you know how to simulate the failures you are almost ready to go. Below are a few other roles and points you will need to decide-upon.

The Game Day Runner

This person is in charge of coordinating the day. They are in charge of injecting the turbulence into the system, recording the experiments and the participants’ behaviour in the face of this turbulence.

Game Day Observers

These are to the Game Day Runner what Robin is to Batman. They essentially have the honour of helping the Game Day Runner to record what happens on the game day!

Choosing a Time-Period

Game days aren’t usually a whole day event. It is recommended that you run a game day for a maximum of 3 hours due to the fact that they can be labour intensive and people get bored/fatigued when the days last much longer.

Choosing an Environment

There are a number of schools of thought on this matter. Some recommend that you should run chaos days against production. This is generally uncomfortable for the majority of companies who have a lot less experience of chaos engineering. Therefore, others recommend to run game days against a non-production environment and try to set it up in a way which minimises the “Blast radius“.

Blast Radius

The Blast Radius can be thought of in terms of an atomic bomb being dropped. It may have a small footprint but the explosion which occurs on impact has a much larger effect on the surrounding area and the devastation caused by the fallout can trigger a number of different side-effects. Therefore you want to try and minimise your blast radius. For example, don’t go and simulate taking down a whole region of AWS on your first game day! This will only convince skeptics that chaos engineering really does cause chaos!

Choose Your Style

  • Dungeon and Dragons — The day is run in such a way that all those involved other than the game day runner have no idea of what is going to happen.
  • Informed — Everyone is informed about what is going to happen, but maybe they are unaware of when it will happen.

Safety Net

Always have a safety net. Always know how to rollback your experiments. Always have a domain expert on hand to help with any issues which arise who will be able to determine how to clean-up any side-effects of your experimentation.

Who to Invite?

Everyone from the CEO down who should have an interest in the system being experimented on. You need people to feel the pain! When people feel pain, they have a great interest in relieving it!

Take Key Members Out!

Most systems have people who naturally become key members in the team. Whether that be because they are the longest serving team member or the most knowledgeable on a certain subject or technology. Therefore it can be a good idea to remove them from the game-day to see how the rest of the team acts in the face of turbulence.

Faking Failure

Some devious game day runners can cause chaos without lifting a finger. They do this by telling everyone that there is a problem with the system. This causes people to go looking for problems which can lead to the un-earthing of some pre-existing issues unknown to the team.

Safety Net

As was mentioned earlier, the aim of a chaos day is not to cause chaos. Therefore we need to have a safety net. This usually takes the shape of the domain expert who has a lot of experience dealing with issues in the system being on-hand to help if serious issues occur.

Post-Chaos Day Actions

Fix Problems Un-Earthed

Whether these are technical or process based we need to ensure that the lessons we learn on chaos days are not in vain. Therefore there should be a clear plan of how we can fix the issues uncovered. This can be done through scheduled work, training or another medium.

Automation

The main thing you want to do is try to create automated tests which run alongside your usual unit/functional tests which simulate the turbulent conditions created during your chaos day and ensure that they no longer occur. It is a good idea that this is added as part of your deployment pipeline. There are a number of tools on the market to do this but ultimately it is about being able to ensure that the tests are consistent, repeatable, provide good feedback and offer value to your business.

Benefits

Ultimately, unearthing so-called “Dark-Debt“ before your clients are affected will save your business money in the long run. The happiest customers are the ones who receive the service they signed up for.

[1]: The STELLA report https://snafucatchers.github.io/

Just Eat Takeaway.com is hiring! Want to come work with us? Apply today.

--

--