Why You Should Occasionally Kill Your Data Stack

Three inspiring chaos experiments to help any data team make their stacks more resilient

Sven Balnojan
Geek Culture

--

Picture by Sven Balnojan.

Ever randomly deleted instances, running containers, databases inside your production system?

Some time around 2008, a crazy or rather brilliant engineer around Netflix started to randomly delete ec2 instances of engineering teams in the production system.

Why on gods earth would he do that? Turns out, he was teaching teams how to properly set up ec2 instances to stay reliable. As they do fail from time to time for no apparent reason, there is an easy way to making sure they never “truly” fail. That solution is called an “auto-scaling-group” and will automatically (hence the name) spin up a new instance, if the old one breaks.

Low and behold, Netflix took this practice rather seriously, and basically invented the field of “chaos engineering” — where faults are injected into the system to test and improve its robustness. Data teams with their monolithic software stacks are particularly prone to large scale unexpected failures and as such can learn a lot from chaos engineering.

While there is a great deal of resources around doing chaos engineering for software engineers, there isn’t too much for data engineers. So…

--

--

Sven Balnojan
Geek Culture

Head of Marketing @ Arch | Data PM | “Data Mesh in Action” | Join my free data newsletters at http://thdpth.com/ and http://finishslime.com