Why You Should Occasionally Kill Your Data Stack

Three inspiring chaos experiments to help any data team make their stacks more resilient

Sven Balnojan

Published in

Geek Culture

5 min readFeb 17, 2023

Ever randomly deleted instances, running containers, databases inside your production system?

Some time around 2008, a crazy or rather brilliant engineer around Netflix started to randomly delete ec2 instances of engineering teams in the production system.

Why on gods earth would he do that? Turns out, he was teaching teams how to properly set up ec2 instances to stay reliable. As they do fail from time to time for no apparent reason, there is an easy way to making sure they never “truly” fail. That solution is called an “auto-scaling-group” and will automatically (hence the name) spin up a new instance, if the old one breaks.

Low and behold, Netflix took this practice rather seriously, and basically invented the field of “chaos engineering” — where faults are injected into the system to test and improve its robustness. Data teams with their monolithic software stacks are particularly prone to large scale unexpected failures and as such can learn a lot from chaos engineering.

While there is a great deal of resources around doing chaos engineering for software engineers, there isn’t too much for data engineers. So…

Why You Should Occasionally Kill Your Data Stack

Three inspiring chaos experiments to help any data team make their stacks more resilient

Written by Sven Balnojan