Building resilient services at Prime Video with chaos engineering

Originally published at https://aws.amazon.com on August 18, 2020 by Varun Jewalikar and Adrian Hornsby

Adrian Hornsby
The Cloud Architect
12 min readAug 18, 2020

--

Large-scale distributed software systems are composed of several individual sub-systems-such as CDNs, load balancers, and databases-and their interactions. These interactions sometimes have unpredictable outcomes caused by unforeseen turbulent events (for example, a network failure). These events can lead to system-wide failures.

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent events. Chaos engineering requires adopting practices to identify interactions in distributed systems and related failures proactively, and also needs implementing and validating countermeasures. The key to chaos engineering is injecting failure in a controlled manner.

In this post, we present a simple approach for fault injection in systems utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS), and its integration with a load-testing suite to validate the countermeasures put in place to prevent dependency and resource exhaustion failures. A typical chaos experiment could be generating baseline…

--

--

Adrian Hornsby
The Cloud Architect

Principal System Dev Engineer @ AWS ☁️ I break stuff .. mostly. Opinions here are my own.