Happy Chaotic times at the LEGO Group
Learn how we, at the Data Engineering and Management department, apply chaos engineering techniques to discover and overcome turbulent conditions in our cloud-based systems.
Before we start disrupting the whole service… what exactly is chaos engineering?
Any company out there wants to provide their users with reliable systems, however, many factors can affect this reliability, thus there’s a need to establish evidence on how resilient the systems are when unexpected, but inevitable conditions occur.
Chaos engineering [1] is a scientific discipline whose goal is to surface evidence of systems’ weaknesses before these become critical, i.e. negatively impact our users and/or customers. Through experimentation, we gain useful insights on our systems’ behavior given different types of turbulent or extreme conditions.
In today’s world, many development teams rely on unit and integration testing. However, both are usually applied under ideal circumstances, and not in production environments. Any complex system will always introduce dark debt, which can be described as some sort of instability that can threaten the reliability of their system. Developers have no real way of identifying what it looks like, or how it will behave at any point in time, but chaos engineering provides a way of exploring these to find out whether your prior assumptions of the system’s resiliency will hold in the real world.
One last thing before we dig into our experiments, which is extremely important to remember: dark debt can be present anywhere in your system, from the platform hosting it and code, to the people, practice and processes used to deliver it.
Dark Debt… here we come!
At the time of writing this post, we have implemented and applied three chaos experiments to ensure our systems’ reliability:
- DDoS attack against our AWS API Gateways.
- Termination of random ECS Fargate tasks.
- Various sorts of chaos injection to AWS Lambda functions.
These experiments have been used to identify possible flaws and improvement points for one of our projects, namely an AWS-based service that uses machine learning algorithms to moderate images, text and video files, to ensure that only child-friendly content is published in our SoMe application, LEGO® Life.
To implement and run all the experiments detailed in this blogpost, we use the Chaos Toolkit API.
Experiment 1: DDoS attack on API Gateway
In this first experiment, we want to ensure the service we provide is always reachable by our internal customers, independently of demand and workload (auto-scaling), but also in the given case of illegal use (i.e. unauthorized users‘ attack).
To ensure this, we first define a steady-state hypothesis, or what we believe the system is supposed to look like. In this experiment, we define the steady-state hypothesis as follows: under any circumstances, our AI Moderation Service’s API Gateway can receive and correctly process calls from our customers.
To verify this, we can log into our CloudWatch metrics, and see that our API GW is responding as expected, since it is being invoked while not raising any 4XX or 5XX error.
Now, onto the fun part — breaking stuff, or at least trying to :)
Using the Chaos Toolkit, we define the experiment.json shown below. In this experiment, we start by testing that the system is running normally, followed by three locust attacks, with the following setup:
- The first locust attack runs for 2 minutes with up to 10 users.
- The second locust attack runs for 3 minutes with up to 50 users.
- The third, and last, locust attack runs for 4 minutes with up to a 100 users.
All these users will effectively do the same operation, which is either invoking our API GW with a malformed payload, or with an unauthorized token.
The experiment takes around 10 minutes to run, and ends up running the steady-state hypothesis again, to validate that the service is still operational. In the production environment, the team is in charge of validating that the service is running by simply looking at the flow of invocations.
The results of running this experiment in our production environment showed the CloudWatch metrics below, with more than 2k unlawful invocations of our API GW:
Needless to say, this experiment provided us with great insights, and taught us before it was too late about CloudWatch metrics’ alarms to ensure we react to these attacks. At current stage, we are unaware of any solution provided by AWS to defend API GWs from such attacks.
On a fun note, a bug was also discovered (yes, a bug even after unit and integration testing!) that was causing the 5XX error that should’ve been a 401 one.
The final step is to check the CloudWatch metrics after 30 minutes have passed. As we can see in the image below, the AI Moderation Service is back to normal, and handling further customer requests.
Experiment 2: Random failure injection in ECS
In our second experiment, we tackle the core component of the AI Moderation Service, namely the Machine Learning Fargate Cluster. To put things into perspective, we have several tasks running in this cluster, each with a specific machine learning algorithm. One of the tasks could be receiving an image, processing it, and identifying whether people exist in the input image, in which case the task should raise “person detected” as response.
As an additional non-functional requirement, the AI Moderation Service is supposed to return a moderation-response to the client in less than 5 minutes.
The steady-state hypothesis for this experiment extends the previous (the service should always be available), plus all ML-Fargate tasks should always be running to provide value to the customers.
The chaos experiment is as shown below. In this case, the first action is set to stop a random ECS task using Chaos Toolkit-AWS plugin. Then, it waits 150 seconds before setting a desiredTaskCount for an ECS Service to 0, which effectively means that ML Service will not be operational. Next, the experiment waits for 1 hour, and finally checks for all tasks to be up and running as normal.
The results from this experiment provided us yet again with valuable knowledge, which we have since then applied to our pipeline:
- When the task was stopped, the ECS Service in charge of it instantly relaunched a new one, with the only delay being the time taken to fetch the container + run it.
- When the desiredTaskCount was set to 0, that ECS Service’s alarm was raised, which then triggered the auto-scaling policy attached to it, returning the desiredTaskCount automatically back to 1.
With these in mind, we decided to integrate an extra component that ensures all inputs are processed in a 3-minute time period. Thus, if an ECS Task fails, the input content will just skip that given ML Fargate Task and continue down the pipeline with the other responses instead. This ensures our customers will always receive a response within the 5 minute requirement set by our internal customers.
Experiment 3: Random failure injection in AWS Lambda
The third and final experiment tests the components and services provided by AWS Lambda. In our AI Moderation Service, Lambda functions are widely used, as they provide excellent functionality at a very low cost, and can be easily deployed using tools like AWS SAM.
For this, we implemented a custom Python package that includes a function wrapper that is attached to every single lambda handler in our service. At each lambda execution, the handler will validate the given chaos configuration and apply it (if relevant). The general idea and implementation was developed by Adrian Hornsby in his repository.
Once the wrapper is added to our functions, the experiment can be executed. First, we verify that the initial chaos configuration is empty. Then, the chaos engineer sets the configuration parameters he/she would like to apply. This configuration can be as follows:
- Add delay to the lambda execution.
- Force the return of the lambda function to have a different status code (i.e. 400 instead of 200).
- Force an Exception message in the lambda execution
The chaos configuration can include none, one or all of the above in the same call. Each of the events can then happen in the invocation given the probability specified for the experiment.
After the experiment is finished, the chaos configuration is emptied again.
This last experiment raised many errors when we added a 1-second delay to the duration of our lambda functions. Of course, we could just increase their timeout, but we also opted to include a better retry functionality, in case these events happened very rarely. Another improvement added by these was, again, an increased number of CloudWatch metric alarms, together with the integration of a Dead Letter Queue that provided us with more details on what actually went wrong.
This is, for now, the chaos setup we have implemented and actively use in our department. As future work, we want to extend the number and reach of our experiments, as well as research and implement new chaos ways for resiliency testing of machine learning algorithms and processes.
Do you apply chaos engineering in your work? Maybe the same/similar experiments as the ones here? We’d love to connect and hear your feedback on how this is helping your team/department/company!
About me
Hi! My name is Francesc, and I’m an applied machine learning engineer working in the Data Engineering and Management Department under the Digital Technology area at the LEGO Group.
Our mission is to pioneer and deliver the capabilities to enable and automate data driven insights and data products, such as the moderation service shown in this post, recommendation engines for www.LEGO.com, and many more.
References
[1]: R. Miles, Learning Chaos Engineering. O’Reilly Media Inc., August 2019. ISBN: 9781492051008.