4 Real-World Scenarios That Read Like Chaos Engineering Experiments

Understanding chaos engineering through everyday experiences

Richard Heffron

Published in

Capital One Tech

7 min readMay 19, 2020

lab table with beakers and bottles around it, with large flames that a man and woman are attempting to blow out

(scene)

You love your Mom, right? She’s the best. She brought you into this world, changed your diapers, helped you with homework, taught you how to cook, clean, and essentially take care of yourself. She’s awesome. And what did she ask from you? Not much. Maybe you could call every once in a while. It’d be nice to hear from you. Maybe get her something nice for Mother’s Day. You know, to show you care.

But in this scene, you, unfortunately, have waited until the last minute to get her that perfect gift. Sure, there’s a lot happening right now, but still. Be better. At least you found her that gaudy — I mean tasteful — necklace she wanted. In fact, you’ve secured the last one left in the online retailer’s stock. You can already picture her opening the box and beaming at you with pride and delight. This time, finally, you were the good sibling. You move the necklace into your checkout cart, you’re ready to pay, and… BOOM! The online retailer’s server crashes. Toast. Doesn’t come back up for hours. And when you’re finally able to log back in, the necklace is gone! Now you have to go with Plan B, like flowers or a fruit basket.

Sigh.

In fairness, it didn’t have to be this way. Sure, you could have been an overachiever and had the necklace lined up weeks ago, but by the same token, that online retailer should have been better prepared for the increased Mother’s Day traffic.

(/scene)

***

Chaos engineering experiments would have been the perfect solution to stress test the server’s ability to handle a nation of children waiting until the last minute to get their mom a gift.

There’s no shortage of articles out there explaining chaos engineering, so I won’t dive too deeply into the origins. In a nutshell, chaos engineering allows you to experiment on a software system to see how it will react to various failures and unexpected disruptions such as:

An EC2 instance suddenly stopping
An unexpected CPU or memory spike
Unknown network latency
An ECS container getting deregistered
A Kubernetes node getting deleted

Chaos engineering experiments are a safe and practical way of testing a software system’s resiliency — its ability to tolerate these types of failures while still ensuring adequate quality of service.

Instead of the same old boring chaos engineering overview, I thought it would be fun to lay out some tech experiments and provide corresponding real-world scenarios.

1. Chaos Engineering Experiments — Load Balancers

In the tech world, load balancers distribute incoming network traffic across a group of backend servers. They route requests to ensure they’re handled with maximum speed and efficiency. If a server goes down, the load balancer adjusts — routing and distributing traffic to the other servers.

With chaos engineering, you can test your load balancer’s settings to see if they’re optimal for reducing outages. You can run an experiment where you deregister a target from your load balancer’s target group and observe what happens. Will traffic still be routed and distributed efficiently, or will it crash the system, preventing our procrastinating protagonist from buying his mom a timely gift?

In the real world, I like to think of this experiment like supermarket checkout lines. You’ve got your standard lines with a cashier, your 10-Items Or Less lines, and then your self-checkout station. But what happens when people don’t respect the 10-item limit or bring their week’s worth of groceries through the self-checkout?

Chaos, that’s what.

blurred image of a factory line of works and products

Your load balancer would be like a shift manager making sure that enough registers are up and running, reminding people to respect the 10-item limit rule, and directing longer lines to other registers to even out the wait times. If it does its job correctly, no small children will be lying down on the floor, waiting out the interminable lines.

2. Chaos Engineering Experiments — Security Groups

Security groups are essentially virtual firewalls. Their rules control the inbound traffic that’s allowed to reach your instances and the outbound traffic that’s allowed to leave them. They protect your resources by ensuring they’re only exposed to trusted resources and IP addresses. “Never trust/always verify” is a core principle of a well-managed security approach.

A great chaos engineering experiment is to swap out the security groups for a specified load balancer. What happens if a random security group sets the rules? Will non-trusted traffic still pass through? Ruh-roh. Remember, the whole point of these experiments is to find issues before they become production problems.

In the real world, I think of this experiment as TSA at the airport. The agents are pretty much almost always professional, courteous, and do the best they can. But what would happen if you switched out these trained professionals with random folks off the street? You could end up seated next to a passenger who snuck their carnivorous house pet through customs.

orange and black tiger emerging from darkness — photo created by subinpumsom — www.freepik.com

Tigers may be all the rage right now, but I don’t want to sit next to one on a cross-country flight. They never respect armrest etiquette.

A trained TSA agent isn’t going to let an animal inconsistent with its guidelines onto the flight, let alone an apex predator. When your chaos engineering experiments expose similar security group gaps, you can work to mitigate them. Be cool cats and kittens and make sure your security groups are doing their jobs, too.

3. Chaos Engineering Experiments — CPU Spikes

Sometimes your local machine is going to run slowly, like when you miss your morning cup of coffee. There can be any number of reasons for the lags, but prolonged speed issues generally indicate a CPU spike issue (i.e., a CPU hog). You’ve got a process stuck somewhere and it’s keeping other programs from running properly. Maybe you opened up that phishing link against your better judgement (attractive singles in my area want to meet me?!), or maybe your bored kids borrowed your laptop and downloaded every game, show, and movie they could find. Whatever happened, your machine is taking F…O…R…E…V…E…R…to load.

You can run a chaos engineering experiment to force a CPU spike to see how well different apps on your local machine function under the stress. You can even customize the spike percentages to reflect varying degrees of spikeyness. It’s a great way to test your system’s resiliency and find your thresholds for handling volume. You can find out the breaking point between acceptable performance and seriously considering taking your machine to a witch doctor to exorcise the demons inside.

I like to think of CPU spike experiments as being the beginning of a new month when 100’s of new shows and movies hit your favorite streaming services all at once. You want to watch them all, and then you just kind of freak out because there’s so many options to choose from. Better Call Saul! Killing Eve! The Good Place! Ozark!

Aaaaggggghhhhh!!!!! Can you handle the binge-watching overload, or will your dishes, laundry, and unpaid bills pile up past the point of no return?

4. Chaos Engineering Experiments — Drain Nodes.

Okay, so draining nodes sounds kind of gross, but don’t worry — you don’t have to get a dermatologist involved. All this really means is that you’re evicting pods from your node in Kubernetes. What you’re doing here is ensuring you no longer have any pods scheduled on your nodes, and any currently active running pods are evicted.

In Kubernetes terms, pods represent running processes on the nodes in the cluster. You always want to allow them to terminate gracefully when they’re no longer needed — giving you a chance to clean them up.

You can run chaos engineering experiments to identify either specific or random nodes to drain. You can also set parameters, like how many seconds you want to wait for the nodes to drain or how many random nodes should be affected. So what will happen if your nodes drain without you cleaning them up first? The container orchestration struggle is real.

I think of this experiment like a terrible roommate. You know, the one who’s always late on rent (but never late to go out for drinks), eats your food, clogs your toilet, and never does dishes. She seemed so fun and refreshing at first, but now she’s ruining your life.

While you’re definitely ready to move on from this roommate, you want to do it in a way that won’t cause any new problems. A graceful eviction here means providing notice, and following all the legal guidelines and best practices for kicking someone out of your flat. You don’t need any more drama in your life. That’s true for containers, too — drain your nodes, and keep your containers running in harmony.

***

While we’ve all experienced quite enough chaos in our daily lives these days, injecting some chaos into your software development in a careful and controlled environment is just good practice. Figure out where the stressors and inflection points await, and then work to mitigate them before they turn into production incidents. Your moms will be so proud of your initiative and preparation. And isn’t that really the best gift you give them? Still, maybe get that necklace a little earlier next year though, just in case.

DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.