Chaos Pygmy Marmoset

Rosemary Wang
6 min readDec 4, 2017

--

If you’ve never seen a pygmy marmoset, feast your eyes on this minute, adorable creature.

Pygmy Marmoset!

The pygmy marmoset is known to be the smallest monkey in the world. So what do I mean when I mention “Chaos Pygmy Marmoset”?

A few key events have happened in the past week that triggered this post:

After I presented, quite a few people were interested in learning more about the process. For future posterity, I’ve outlined them here.

The Origin Story

It was almost four years ago that I first heard about Netflix’s Simian Army, which included the Chaos Monkey. Netflix created Chaos Monkey to randomly take down virtual machines and processes, allowing teams to test their application under failure and build resiliency. When I brought this up to my teammates, they were horrified at the idea of terminating machines and process at random (even though I wasn’t recommending it in production). I dropped the idea of executing a full Chaos Monkey. My joke to some of the developers was that our infrastructure and applications weren’t ready for even a Chaos Capuchin Monkey. I drafted a formulation of Chaos Pygmy Marmoset, a much more controlled version of Chaos Monkey, that could help train new operators and build better systems without randomly terminating systems. I realized,

what if I documented all of the bugs I ran into during the development process and mocked those for operators, sort of like a quiz?

Over time, as Chaos Monkey grew into a larger field of Chaos Engineering, I felt that Chaos Pygmy Marmoset could finally be classified. Even more importantly, I realized that it could be a stepping stone for operators to gain confidence and understanding about Chaos Engineering.

So what is it?

I use Chaos Pygmy Marmoset to describe time-boxed exercises of known problems, administered to operations teams (or even developers!). Whenever we develop or engineer an application or platform, we’re in a constant process of discovering bugs and fixing them. When we know the application or platform, it’s pretty intuitive to know where to look, even if it’s some obscure log message hidden under a directory. We gained this tacit knowledge by experience. Tacit knowledge requires technical expertise but can only be gained by the act of doing. Learning to ride a bicycle is a great example.

Chaos Monkey at its core exposes engineers to failure, providing a framework for developing and improving resiliency.

Chaos Pygmy Marmoset exposes teams to the possibilities of failure, building a common foundation for developing and improving recovery.

For most organizations who don’t have the appetite to do Chaos Monkey, they won’t have structured random failures to help developers build resiliency. Chaos Pygmy Marmoset takes the subset of known failures or problems, builds that knowledge with the team supporting the application or platform, and facilitates the conversation of developers building resiliency to address specific failures or problems.

How do you prepare?

When you start the development process, document each bug or defect you come across that causes a fatal error upstream (or your application from running). Some examples of this might be:

  • System plugin failure, preventing application from running
  • Incorrect configuration
  • No network connectivity to database
  • Health check failures
  • A dependency does not exist
  • And more!

I try to format it like a bug report, since that is what it is after all. I include:

  • Issue Description
  • Steps to Reproduce
  • Expected Result
  • Actual Result
  • Time to Resolution

These bug reports should go into your repository of “known problems”. You also have a time to resolution, which lets you estimate how long the rest of your team with no awareness of the problem might take to debug and fix it.

The Chaos Pygmy Marmoset Workflow

Below is the workflow to do a Chaos Pygmy Marmoset exercise.

The Chaos Pygmy Marmoset Workflow

Schedule Some Time

I schedule some time with a subset (or all!) of the team members who are going to support my application. If you have an operations team involved, you’ll need to get buy-in from your operations team manager. It’s usually not a hard sell, since the session is a great way for operators to develop their skills and for you to discover problems with your operations setup. If your operators are remote, you can use a conference line and ask everyone to make sure they are ready to share their screen.

Choose your Bugs

I try to pick bugs that have a time to resolution that fits into the time slot. I usually add about 10–20 minutes as a buffer, since the operators may not be fully familiar with the process. For each exercise, I work out how to reproduce my bugs using the “Steps to Reproduce” section of the reports. I try to write a script for each bug (AKA exercise) so I don’t have to sit there manually issuing commands. After I’ve compiled a list of bugs and scripts to reproduce them, I’m ready to gather the team and have them solve it.

Initiate Marmoset on Bug

When the session starts, I level set with the participants that:

  • A participant must “drive” the troubleshooting from their laptop. This is the cue for them to nominate someone.
  • I can provide hints but the team must debug to the best of their ability.
  • This exercise is time-boxed.

At this point, I choose my first bug and run the script associated with the bug, initiating the first Chaos Pygmy Marmoset exercise.

Alert Fires (or Mock a User Report)

When the script triggers the bug, I usually play the role of the user, who sends a report to operations that something has gone wrong. Sometimes, the bug will have an alert associated with it, so I let the alert fire as if it was in production. If there is some confusion in the interpretation of the alert, I make a note to clarify the alert later. For my mock user report, I try to be ambiguous, since most user reported problems don’t have full descriptions.

Ops Debug

The entire team is responsible for figuring out what the root cause of the problem might be, directing the driver of the troubleshooting session. During the troubleshooting session, you’ll get a lot of questions and sometimes, even frustration. Usually if a team is running around in circles for more than five minutes, I give them a subtle hint about where to look and what they might want to consider. Keep in mind, however, that debugging is an art — everyone has their own way of intuitively figuring out what happened and it can be subjective.

The key objective during the debug is to allow operators and new developers to gain familiarity with the system, look for problems, and recognize system nuances.

When the team fixes the problem or time has run out, we do a short retrospective on the problem and its solution.

Retrospective

As the developer or engineer on the project, I want to know what went well and what can be improved. Some specific questions I like to ask are:

  • Was the alert clear in context?
  • Are the log messages helpful?
  • What was a pain point during troubleshooting?

It’s usually a good way to gauge operations confidence with the system. After the retrospective, if there is more time, I’ll ask another participant to drive the next bug and I unleash the next pygmy marmoset.

Why Bother with Chaos Pygmy Marmoset?

Most of the time, we operationalize an application or a platform but we don’t get a sense of whether or not it’s ready for operations use. I see this as another step in a pipeline, a stage to determine operational readiness.

We need to start to think about promoting operations between environments like applications.

During Chaos Pygmy Marmoset, if the alert causes confusion, I know that the alert is not ready for production use. If the alert is clear and the team can effectively use it to triage, then I know I can promote it to production. Furthermore, if a particularly complex feature is being released and the team struggles during the exercise, I know that the feature may not be ready from a support perspective. With all of this talk about Chaos Engineering and site reliability, we need to figure out some way of challenging our teams and getting the failure experience we need — Chaos Pygmy Marmoset might be an intermediate way of accomplishing it.

The References

--

--

Rosemary Wang

explorer of infrastructure-as-code. enthusiast of cloud. formerly @thoughtworks. curious traveller & foodie.