Exploring Spring Boot resiliency on AWS EKS

The power of great discoverers at the tip of fingers

Over the past few months, Russ Miles and I have been exploring various facets of Chaos Engineering. From the principles of the practice to some concrete examples such as improving your operational workflow by introducing controlled perturbations in your system. This summer, we will be introducing new exciting features we are looking forward to see our users play with.

Until then, I’d like to continue on the trail of examples I went through previously. But this time, we’ll do it against AWS EKS, the managed Kubernetes offering from AWS and using a Spring Boot application. Both solutions are fantastic. Spring Boot is extremely common as an application framework. EKS is fairly new on the managed Kubernetes scene but no doubt it will become a key player rapidly.

Setting the Scene

I wrote two Chaos Toolkit experiments to showcase two aspects of resiliency you would want to explore via Chaos Engineering.

The first one looks at how the application performs when latency is introduced between two services communicating over HTTP. This is achieved through the excellent Chaos Monkey for Spring via the corresponding Chaos Toolkit driver.

The second experiment takes the opportunity of exploring how your system reacts when a whole AWS EC2 instance goes down, removing a Kubernetes work node from the cluster along with it. This is achieved through the AWS API via the Chaos Toolkit driver for AWS. For good measure, we also use the driver for Kubernetes to probe the system along the way.

The system consists of two Spring Boot microservices conversing over HTTP. The frontend service simply calling the backend service to compute something based on the user input.

Those two services live on Kubernetes in a EKS-managed cluster. For the story, we use the awesome eksctl sponsored by Weave Works to create our cluster easily.

The entire code (application, manifests…) can be found here.

What’s the impact of latency between two services?

As mentioned above, the system is basic and consists of two Spring Boot applications: frontend and backend. Whenever a user hits the frontend, a call is made to the backend from the frontend application and its result is returned to the user.

However, one aspect is critical to this system (from a business point of view as well as a technical one), the response from the backend must be made under one second. So, while we feel confident the system should be that fast consistently, there is a risk it may be slower from time to time. Chaos Engineering is a perfect way of exploring what could happen in this case.

Our experiment is fairly basic to start with but it shows the idea behind the Chaos Toolkit and our flow.

As you can see, we start by talking to the frontend application and expect it to respond. Then we enable the Spring Chaos Monkey, embedded in the backend application itself, and ask it to add some latency to its network exchanges.

"method": [
{
"name": "enable_chaosmonkey",
"type": "action",
"provider": {
"func": "enable_chaosmonkey",
"module": "chaosspring.actions",
"type": "python",
"arguments": {
"base_url": "${base_url}/backend/actuator"
}
}
},
{
"name": "configure_assaults",
"type": "action",
"provider": {
"func": "change_assaults_configuration",
"module": "chaosspring.actions",
"type": "python",
"arguments": {
"base_url": "${base_url}/backend/actuator",
"assaults_configuration": {
"level": 1,
"latencyRangeStart": 10000,
"latencyRangeEnd": 10000,
"latencyActive": true,
"exceptionsActive": false,
"killApplicationActive": false,
"restartApplicationActive": false
}
}
}
}
],

Finally, we simply call the frontend again, which in this case tells us it went over budget and failed to match the tolerance of 1 second we had setup for it. The way we do this here is by setting a timeout on the call from the frontend to the backend.

"steady-state-hypothesis": {
"title": "We can multiply two numbers under a second",
"probes": [
{
"name": "app-must-respond",
"type": "probe",
"tolerance": {
"type": "regex",
"pattern": "^[0-9]*$",
"target": "body"
},
"provider": {
"type": "http",
"url": "${base_url}/multiply?a=6&b=7"
}
}
]
}

The tolerance validates that the response is simply a number. It suceeds when we first call the frontend, before we introduce latency, and doesn’t even get to be called after the latency is introduced due to the timeout triggered by the slow backend. In that case, the frontend returns an error message which doesn’t pass the tolerance validator.

Through this simple(istic) application-level experiment, we are made aware of the consequences of a slow backend response. Obviously, in a richer microservices system, this could have dramatic ripple effect, and even cascading failure, difficult to debug after they’ve hit our users. Better trigger those conditions ourselves and observe their impact.

Can we sustain the loss of an EKS node?

The previous experiment targeted our application, but we can obviously learn from undearneath by exploring degraded conditions in our infrastructure.

For instance, do we know if our service remains available during the loss of a node? Again, Chaos Engineering gives you the tool to explore such scenario and get familiar with its dire consequences.

We use the system as above but, this time, our experiment hits the AWS infrastructure itself by stopping an EC2 instance running one of our EKS worker nodes. Obviously, this means a reduction in capacity but does it mean loss of availability?

During the experiment method, we stopped an EC2 instance of the EKS pool at random:

Then, we replay back our hypothesis that the service should remain in good shape. Lucky us! Even though we lost a node, we were able to keep the application available. Looking at the Chaos Toolkit probe logs, we can see:

[2018-07-09 15:22:58 INFO] Action: count_backend_pods
[2018-07-09 15:22:58 DEBUG] Found 1 pods matching label
[2018-07-09 15:22:58 INFO] Action: count_frontend_pods
[2018-07-09 15:22:59 DEBUG] Found 1 pods matching label 'app=frontend-app'

Notice however that, with this sort of experiment, your result may vary and your experiment may fail. We are running a single instance of our application, so if the killed node is the one where it is running, Kubernetes will schedule it on the other node which takes time. Still, the approach remains the same and hopefully you get the idea of the learning loop here. It highlights however the need to run chaos experiments continuously since your system is not static. Chaos Toolkit is all about automation so that’s quite handy.

Continue exploring!

I hope these couple of experiments continue showing you the Chaos Toolkit flow and how it can help your understanding of your system. Richer experiments can be created, shared and collaborated on with your team for a healthy dose of familiarity with adverse conditions.

Please, feel free to join us on the Chaos Toolkit Slack workspace. We would love your feedback to make the toolkit an even more delightful tool that really enables automated chaos engineering for everyone!