How Netflix uses chaos in its favor

How many times have we got home excited to watch a movie on Netflix?
You arrived tired from work or school, and there it is; Netflix to serve you and to help you forget about the problems you had today.

I am sure that many of us can relate to that situation; that Netflix is that buddy that doesn’t let us down, like mate for us, the argentine people.
Now, we can think this because, to be honest, the company provides a great service. Think about how many times you received a 404 error, 503 error, or that you had to reload the website because the connection was either lost or reset. Truth is, that if you had any of these problems, probably was your internet provider and not Netflix itself.

How does Netflix manage to be, virtually, always online?
Netflix decided to migrate from big datacenters to the cloud, to less powerful servers but in greater quantity across the world.

Not every server does the same job but they all can be used with the principle of redundancy. If a node is offline, I still have another one that may perform the same functions at a lower speed. Thing is, because Netflix uses Amazon Web Services, it does not even need to lose performance, it could simply escalate dynamically based on demand. So a failure would only imply a higher cost, but not performance.

You would think that if a server goes down in production mode, it is not desirable, because in the worst case scenario; you are losing the capability of providing a service to your client and in the best case scenario, your costs would go up.

Why on earth would then Netflix encourage turning off servers in production mode in a random way? For chaos’ sake, of course.

Netflix embraces chaos
 In September 2014, the company started to implement a software developed by them: “Chaos Monkey” (in fact, a whole army). It is described by Yury Izrailevsky and Ariel Tseitlin like: “A tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact”. Basically it shuts down AWS instances randomly.

Now, knowing that the monkey can bring a server offline anytime, the whole production cycle of software is aware of the implications of a server going offline. Netflix cannot wait for failures to test if they have a fault tolerant system, so they introduce the same faults they want to avoid, but in a controlled way, like a vaccine.

In this case, Netflix, uses a tool that let them implement these faults in real time and production time that allows you to watch an episode of Stranger Things or House of Cards while there could be a massive outage going on (it happened before).

Failure is not an option, it’s a fact
Like Cloudflare describes it, failure is going to happen, it can be a network problem, a server issue, etc. The important thing is not the fault itself but how would your system respond to it if it happened now, not some day in the future. The direction where robust, fault tolerant system with high availability needs seem to be going is toward cloud platforms, accepting the existence of faults and acting proactively upon that.