Chaos Engineering and Mesos

Roberto Veral del Pozo
Datio
Published in
5 min readApr 17, 2017

Modern applications are distributed by default. They are composed by a bunch of services that have to communicate with each other (a back-end and a database, for example). This distributed computing is hard, and when these applications are deployed in the cloud it’s even harder. In fact, the number of variables that can produce an outage of these applications is growing very fast. Usually these applications depend on public cloud services like Amazon or Google and these services can have outages, instances can die and other infrastructure related issues can appear. When working in the cloud, resilience (fault tolerance) must be in the DNA of an application. This means that in case some failure happens (and keep in mind that this will happen) an application should remain available with very little impact in its customers. But, how can you be sure that your application is resilient? How do you know that your Amazon S3 backed service will survive to a total outage of an entire Amazon S3 region? (which seems unlikely, but it happened in March, 2017)

With these considerations in mind, Netflix engineers confronted these challenges while moving their platform to AWS and gave us the solution: the principles of Chaos Engineering.

The manifest defines Chaos Engineering as:

The discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Essentially, Chaos Engineering is about getting confidence in the application resiliency by injecting “once in a blue moon” (as defined by Bruce Wong — Netflix) failures and measuring the impact that these failures have in the application’s performance.

This controlled failure injection helps to discover the weaknesses of the application in a controlled environment before they appear, so it’s possible to learn about them, to fix them and indeed increase the application’s availability. Even if you aren’t able to fix the problem, your team will gain knowledge about how to confront this problem and, in case that this problem appears some day at 3.00 AM, the team will know how to solve it in a pretty straightforward way.

For Netflix, every failure and system outage is an opportunity to learn about the application behavior, how to fix it and how to improve its design. For example, during an Amazon ELB total region outage in December, 2012 they discovered that they needed to have a multi-region active-active replica to survive these outages, which weren’t considered when designing Netflix architecture.

It’s recommended to drive these experiments in a production environment, because if you want to see the real impact you need real traffic. The bad news about doing a chaos experiment in production is that you can cause an entire application outage. For this reason, Netflix launches its experiments during office hours, so if the experiment causes an outage the engineering team can solve it as soon as possible.

The phases of a Chaos experiment, as defined in the manifest, are the following:

  1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

It can be briefly resumed in: break things, see what happens and act accordingly to the outcome. There are a lot of experiments that you can do: stop server instances, simulate network partitions, add network latency, revoke connectivity to a database… Be creative (who would have imagine that a DynDNS outage could happen, making half of the internet services unavailable?). Keep in mind that some experiments should be simulated (this is why it’s called controlled failure injection). For example, if you want to see what happens if a service can’t access to a database table, don’t drop the table in production, simply revoke the permissions to access the table.

If a chaos experiment is successful, you shouldn’t stop doing this experiment. As the application evolves and new features are added to it, it’s important to repeat the test often to see regression problems and to make sure that the new features have the same resiliency principles than the rest of the application. That is when having these experiments automated can make the trick (take a look at the Simian Army).

A quick chaos experiment example

As a quick example of how chaos engineering works, in a previous post, we talked about the split brain problem in Akka Cluster and how to avoid it, but we haven’t tested the solution. For doing so, we can do a chaos experiment that causes a network partition between the instances of the cluster, and see if we end up with two clusters or everything works as expected. For simulating the network partition, we can, for instance, add firewall rules to the affected instances which in fact avoids the connection between them.

Applying chaos engineering in Mesos

Apache Mesos is a resource manager that helps scheduling applications and services across the whole datacenter from a single point and using APIs (the same as an OS kernel does with the processor cores in a single machine). When you use Mesos as a service orchestrator, you usually use Marathon to abstract some of the Mesos complexity and manage these services. Marathon is a Mesos scheduler that manages services and applications allowing to scale it and adding self healing (if an instance of a service is killed it will launch a new one) to the services. It provides its own APIs to manage the deployed services. An instance of a service is associated with a task running in Mesos, so it seems that the way to inject chaos in an application is killing some of its services’ tasks and checking the application availability.

With this in mind, driving a chaos experiment in Marathon it’s pretty easy (it’s always easier to destroy rather than build). Taking a look around in the Marathon REST API, we could see that we have an endpoint from which we can list all the services and its associated tasks:

GET /v2/apps?embed=apps.tasks // for all the services running in Marathon
GET /v2/apps/{appId}?embed=app.tasks // for a concrete service

From this query, you can get the IDs of the running tasks of the services running in Marathon. Then, simply select the ones you want to kill (it could be randomly or with another algorithm of your choice, like using Mesos attributes to see which tasks are running in a concrete rack and killing them, simulating a total rack outage) and call the following endpoint to kill the tasks:

POST /v2/tasks/delete { "ids": [ "taskId1", "taskId2", ...  ] }

Conclusion

Chaos Engineering is a useful way to test and improve an application availability and resiliency, but it has to be done carefully, because it’s run in production and the experiments can cause a system outage. However, the profits of driving these kind of experiments deserve the risk.

Maybe if Amazon had used this experiments, they would have discovered that the red status icon in the dashboard for the S3 US-East-1 region was stored in the same region that it was suppose to inform about…

Links and resources

https://arxiv.org/pdf/1702.05843.pdf

Originally published at www.datio.com on April 17, 2017.

--

--

Roberto Veral del Pozo
Datio
Writer for

Principal Engineer, Advisory in Auctane. Backender.