Build System Confidence with Chaos Engineering and GitOps

Discover and Respond to System Weaknesses with the Chaos Toolkit and Weave Works

Have you ever thought about what would happen if the connection between your application and your database drops?

Well, maybe you have since it’s such a simple scenario. Yet, have you actually tried pulling that link away in production as your application was running? Maybe not.

This is the type of questions that Chaos Engineering helps you answer. Ask questions about your system’s behaviour under certain conditions and enabling you to safely try it out live so that you can, collectively with your team, see if there is a real weakness and learn what the right response should be.

This is what the following video showcases. Our basic web application pulls data out of a relational database to render it to the user. Our assumption here is that the application should continue rendering the appropriate results to users even in case of the database link going away.

How can we build that knowledge? By using the Chaos Toolkit to declare and run our experiment while relying on Weave Works to observe our system as this happens, as well as automating the deployment of our response to the discovered weakness.

Full end to end Demo

Our System

Our system lives inside a Kubernetes cluster, including the database. We use Zalando’s patroni operator to manage the PostgreSQL lifecycle. Even if you run your database out of the Kubernetes cluster, the link weakness should be investigated through Chaos Engineering.

Topology of our application courtesy of Weave Scope view

Notice that our application is connected to one of the database pods, but our system runs two instances, a leader and one follower. Patroni supervises them for us.

Our Hypothesis

The link we are concerned about is the one between frontend-app and frontend-db-0. Do we harm our users if this link goes down?

Our steady-state, meaning when our system looks normal, is that our application serves as it should no matter what the condition of our system.

Our hypothesis is therefore that removing the database link from the application will not impact our users.

Below is the excerpt from the Chaos Toolkit experiment declaring the steady state hypothesis:

    "steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"type": "probe",
"name": "application-should-be-alive-and-healthy",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": "frontend-app",
"ns": "default"
}
}
},
{
"type": "probe",
"name": "application-must-respond",
"tolerance": 200,
"provider": {
"type": "http",
"verify_tls": false,
"url": "https://app.cosmos.foo/"
}
}
]
}

Our Experiment

Our chaos experiment should declare the steady state we just described as well as the actions we are going to take to test this hypothesis.

There could be various ways we could perform this test, either by removing the Kubernetes service so that the application cannot resolve the database location or even killing the database pod. The former may not help us here because we have to assume the connection to the database is not short-lived and already created by the application. It would only impact new connections which we do not control in our application due to the framework we are using. So we will fallback onto terminating the database pod the application is connected to.

This experiment is described using a Chaos Open API experiment and can be found here.

Below is the action that terminates the database pod.

{
"type": "action",
"name": "terminate-db-master",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "spilo-role=master",
"name_pattern": "frontend-db-[0-9]$",
"ns": "default"
}
},
"pauses": {
"after": 2
}
}

Running the Experiment

Once you have all the pieces together, you can run the experiment using the Chaos Toolkit.

Chaos Toolkit notifies the team that an experiment is running using the Slack integration:

Notice how the steady-state hypothesis was met before we ran the experiment but not after the real-world conditions were varied. Indeed, the application broke down and our users suffered.

Responding to the Weakness

The experiment showed us that we do have a weakness in our system. At that stage, you should gather the team together and start considering your options. Maybe this is rare enough that you are willing to take the risk? Maybe there is an operational way of reducing that risk? Maybe we can act on the code to deal with such failures?

Whatever your decision, the discussion within the team is important to improve your team’s confidence in the system.

In this demonstration we fix the code so that it will try again and connect to the other database instance so that one user may see a little slowdown but still get the page they asked for.

GitOps Automation FTW

Now that you have a fix in place, you can release it and let Weave Works pick it up and update your Kubernetes cluster with the new version once it’s built using this process:

The GitOps workflow from Weave Works

In our demo we use Travis CI and Docker Hub.

Once Weave detects a new version of our application was built and released (the CI side), it will update the Kubernetes manifest and apply it back to our cluster (the CD side). On the Weave Cloud dashboard you would see a message similar to this one:

Profit!

Once your application was re-deployed you can run the experiment again and see if the hypothesis is now met after the database was terminated.

Rinse and Repeat

Chaos Engineering should be an on-going task, automated whenever possible, so that you keep your confidence high in your system continuously.

The code for this demo is open-source and available here.

Please join the Open-Source Chaos Toolkit community to join the discussion on learning about how to improve your systems through Chaos Engineering.