Is anyone doing in-process Chaos Monkey?

Jared Pochtar
Pagedraw
Published in
4 min readMar 23, 2017

Netflix has a program called “chaos monkey” to make servers, processes, and requests fail at random, even in production, to force themselves to reliably handle the failures inherent in distributed systems. In distributed systems, things will sometimes fail at random, so robust error handling is required. Chaos Monkey itself does not make their systems more robust; technically it makes them more fragile. However, Chaos Monkey reminds engineers that failure is inevitable, and they must design for it. Consequently, Netflix’s engineers design robust systems, and their site is very reliable.

I propose making functions throw exceptions at random. Even though single-process systems do not usually have the same random failures as distributed systems, they are as prone to programmers’ mistakes. We try to minimize these mistakes with static tools like type checking and dynamic tools like test suites, but bugs inevitably slip through. Without robust in-process error handling, such mistakes can easily crash programs and wreck much more damage than they strictly need to.

By making functions throw errors at random, programmers will be forced to robustly handle unforeseen logic errors.

Above: ./build/production/spongebob

Robustly handling random failures would encourage programmers to keep throwing assertion violations in production. Asserting and robustly handling invariant failures in production would make incorrect states get detected, and hopefully walked out of. Incorrect states are even worse than crashing, because they can lead to data loss and corruption. Even a little improvement in fixing incorrect states could have huge improvements in our software reliability. Of course, we’ll still want to remove assertions in performance-critical paths from production.

Similarly, functions could “time out” if they take an unusually long amount of time. These semantics address in-process bugs relating to accidental infinite loops or poor performance. Such behavior in a distributed system could easily result in RPC timeouts, which the caller would need to handle as a random failure. For all the reasons described above, timing out long functions in-process should help programmers address when this happens. It may even make sense to do this every time in production.

Addendum: monoliths vs microservices

???

Microservices are overhyped. HackerNews is aware, and fatigued by them. In many cases, a monolith on Heroku suffices. But there are real engineering benefits of microservices, which really shouldn’t be overlooked. Some good points comparing monoliths and microservices, from here and here:

  • A bug in any module of a monolith can potentially bring down the entire process. Moreover, since all instances of the application are identical, that bug will impact the availability of the entire application.
  • Distributed systems are expected to fail randomly, so programmers make robust microservices by designing to gracefully handle failure from the beginning.
  • Monoliths can be difficult to scale when different parts have conflicting resource needs. Microservices enable each service to scale independently.
  • Monoliths can’t use multiple different languages. You can’t use libraries written in a different language, or, worse, you’re forced to use a language a particular library was written for even though you generally dislike that language.
  • Continuous integration wait times for a monolith are longer, because of the large size of the app to be tested and corresponding large number of tests.
  • Similarly, deploying microservices is faster than monoliths, because with microservices you can rebuild and deploy a single service.
  • Monolithic applications can evolve to where no single developer understands the entirety of the application.
  • The size of a monolith can make it slower to restart.

I think all of these points are true today. I believe we can address these issues for monoliths. Some of these fixes involve innovative scheduling, others resource isolation. Most require changes to underlying languages and runtimes, and would be cool research. Some are just good engineering practice. But microservices have real issues because distributed systems are hard, and it would be ideal if we could bring their benefits to monoliths.

Bringing Chaos Monkey into single process apps gives us one of the benefits in monoliths.

By making in-monolith function calls as likely to fail as remote service calls, programmers will design for failure in monoliths as well as they do in distributed microservice architectures.

If you liked this, keep your eyes peeled for Pagedraw!

--

--