Chaos engineering with Otoroshi and the Snow Monkey

Mathieu ANCELIN
OSS by MAIF
Published in
5 min readJan 29, 2019

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

That’s the first thing you can read on the website of Chaos Engineering, https://principlesofchaos.org/. It’s a community build around principles stated by Netflix engineers. While overseeing Netflix’s migration to the cloud in 2011, Greg Orzell had the idea to address the lack of adequate in resilience testing by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option. By regularly “killing” random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.

Based on those principles, Netflix built the Simian Army, a collection of tools to test their production system constantly in order to improve their system resiliency.

Everything fails all the time
Werner Vogels, CTO Amazon

Here at MAIF, we though it was a good idea to use chaos engineering principles to improve the quality of our apps and services, but also the responsiveness of the product teams in case of production problems. But, unfortunately, we cannot really shut down service instances just like that. I mean, we could, but as we are using a PaaS (Clever Cloud), it’s not possible to kill an instance of the service only. We can just stop the service and that’s all. The Clever Cloud team is managing monitoring of the instances, load balancing between them, restarting of crashed instances, etc. so it is not a viable model for us because it has no meaning (it’s just useful to test what happens when a service goes entirely down). As we can’t operate at the VM/container/instance level, the next thing common to all our services is HTTP.

Otoroshi, the guardian of the shrine

As you may already know, we use Otoroshi a lot, both as an http reverse proxy and an api management tool. During the summer of 2018, we developed a dedicated tool inside Otoroshi that could cause global malfunctions in HTTP communications on our platform. The Snow Monkey was born.

The Snow Monkey

The Snow Monkey is a nickname for the Japanese monkey that is well known for taking bathes in hotspring when winter comes.

Japanese macaques at Jigokudani hotspring in Nagano have become notable for their winter visits to the spa.

The Snow Monkey in Otoroshi is able to generate four types of fault (for now) on http requests/responses :

  • Artificial long request body
  • Artificial long response body
  • Latency injection
  • Unexpected responses injection
The settings for possible faults

For each fault, it is possible to write code either client side or server side to handle the failure correctly and avoid crashing. The idea here is to enable faults and spot issues to fix it. Each fault is activated for a percentage of request (we don’t want to fail everything, but we can). It is possible to manually activate these faults on any service, or, you can just let the monkey chose the service for you ;)

The main feature of Snow Monkey is the scheduling of outages. In the main settings screen of the Snow Monkey, you can chose what services will be impacted, how long, etc.

Snow Monkey page

In this view you can chose how many outages will happens every working period (preferably during business hours so someone can act if something goes really wrong). You can impact either on service per service group (1 * nbrOfOutages) or all service in a service group (services * nbrOfOutages). The Snow Monkey will try to spread the outages over the whole working period to avoid the “everything is broken” issue. You can also select how long will outages be and what service groups will be impacted. You can also use the dry mode, where outage will only generate an event (that you can display in Slack or wherever you want) but no actual faults. It is quite handy during the transition period.

Snow Monkey settings

At the bottom of the screen, you have the list of current outages

List of current outages

To be meaningful, you have to enable the Snow Monkey on production services, every day, and always. This is the only way to benefit from such methodology.

In the case you have critical user facing web application, it is possible to tell the Snow Monkey to avoid creating outages for it (using the “Include user facing app.” toggle and the “User facing app.” toggle in the service page itself) while keeping to create outages for the rest of the services of the service group.

With the Snow Monkey, we made our system globally more resilient and more safe. The product teams are more aware of production issues and more responsive as we are rolling out the Snow Monkey on various production services. But it’s only the beginning of the journey, we hope we will be able to add new kind of faults very quickly to the Snow Monkey and use Snow Monkey everywhere. If you have a nice idea about a new kind of fault, do not hesitate to create an issue on the Otoroshi github repo ;)

--

--