Large-scale distributed software systems are composed of several individual sub-systems-such as CDNs, load balancers, and databases-and their interactions. These interactions sometimes have unpredictable outcomes caused by unforeseen turbulent events (for example, a network failure). These events can lead to system-wide failures.
Chaos engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent events. Chaos engineering requires adopting practices to identify interactions in distributed systems and related failures proactively, and also needs implementing and validating countermeasures. The key to chaos engineering is injecting failure in a controlled manner.
In this post, we present a simple approach for fault injection in systems utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS), and its integration with a load-testing suite to validate the countermeasures put in place to prevent dependency and resource exhaustion failures. A typical chaos experiment could be generating baseline load (traffic) against the system, adding latency to all network calls to the underlying database, and then validating timeouts and retries. We will explain how to inject such failure (addition of latency to database calls), why validating countermeasures (timeouts and retries) under load is essential, and how to execute it in an Amazon EC2-based system. …
First of all, I would like to thank everyone in the AWS Community in Australia and New Zealand and the AWS Heroes who have helped put this event together.
A little reminder that if you plan to share your day with others on social media, please use the hashtag:
A few years ago, I did a talk called ten lessons from ten years on AWS.
That was at the Community Day in Bangalore, India. Back then, there wasn’t any COVID virus, so I traveled there — and I loved it. …
I’d like to express my gratitude to my colleague and friend Arni Birgisson for his valuable feedback.
Since I published my blog series Towards Operational Excellence, I received a relatively large amount of feedback. But one question, in particular, stood out.
“Can you share an incident postmortem template?”
In this blog post, I will share an example incident postmortem template, which I hope will help you get started. I will also share some DOs and DON’Ts that I have seen work across a wide variety of customers — both internally in Amazon, and externally.
A postmortem is a process where a team reflects on a problem — for example, an unexpected loss of redundancy, or perhaps a failed software deployment — and documents what the problem was and how to avoid it in the future. …