Chaos In Manhattan — ML For Chaos?

Matthew Cloney
SingleStone
Published in
3 min readApr 1, 2019

On Thursday, March 28, my colleague Ryan Shriver and I attended the 4th Chaos Community Day at Work-Bench in New York City. The event was hosted by Casey Rosenthal, formerly of Netflix and one of the major players in the space known as Chaos Engineering, which is described as “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Or said another way, “fixing stuff in production.”

The invite-only event was filled with 30-minute presentations delivered by the “who’s who” in the space, including Nora Jones (also formerly of Netflix), Charity Majors (CEO of Honeycomb), and Kris Nova (of VMWare). We had the opportunity to do a lightning talk — a five-minute think segment posing the question, “Is there a place for intelligent chaos agents?”

Our talk looked at parallels between chaos engineering and machine learning. In particular, the concept of dropout in neural networks and reinforcement learning.

Dropout

In my mind, the thing that most closely models chaos engineering in deep learning is the concept of dropout. This is something introduced in 2014 by Geoffrey Hinton, one of the great minds in the space. There’s a problem in supervised machine learning called “overfitting,” where your algorithm essentially memorizes the data and learning stops. When this happens, you get great results on your training data, but not on your test data, which is what really matters.

Enter dropout. The concept is simple, and a little mad, like chaos engineering. For a layer of a deep network, dropout means that every time I pass through the layer, a certain percentage of these neurons are disabled, or “dropped out.” The neurons that are dropped out are picked randomly each time. Like chaos engineering, this ends up making the system stronger by improving the performance of the model’s prediction. This has been shown to improve the work on both image and speech recognition, as well as in other problem domains.

Intelligent Chaos Agents

Is there a place for intelligent chaos agents, and what might that look like?

One option might be reinforcement learning, which has four components:

  1. An environment
  2. An agent
  3. An action performed by the agent
  4. And finally, an observation of what that action did and the assignment of a reward

The concepts of environments and agents have fairly equivalent parallel concepts in chaos engineering. The definition of an action would largely remain unchanged. An action might be “spike a CPU,” “add latency to a request,” “route all traffic into a black hole,” or other common actions used in chaos engineering.

Then there’s the observation and reward, where I believe the most work is required. Again, an observation evaluates the result of the action on the environment, and a positive or negative “reward” is assigned. The goal is to maximize the total reward, akin to a video game where you want to maximize the final score.

But this opens up lots of questions:

  • How would the reward be calculated?
  • What would the rewards look like?
  • What might the observations entail?
  • Might reinforcement learning be an appropriate vehicle for this?

We’re just starting to explore some of these ideas, and we’re curious about whether there is a place for machine learning in chaos engineering…or not. If you have thoughts or insights, please share them. We’d love to hear from you!

--

--