Bridging the Gap — Kubernetes on AWS & HTTP 5xx

Published in

keikoproj

5 min readDec 2, 2019

Kubernetes is awesome, no doubt, but sometimes, the lack of awareness between cloud resources and Kubernetes can lead applications to experience HTTP 504/502 errors due to infrastructure related reasons.

As part of building the Modern SaaS developer platform at Intuit, we faced exactly that — and we hope to share the solution we implemented across our fleet of 160+ Kubernetes clusters which are running today 5xx free.

Why is this even happening?

Before trying to solve this problem, we should first look at the why.

Outside of application related 5xx, why exactly were our applications getting 502/504 errors for no obvious reason?

Node level termination events

There are many things that can cause nodes to be terminated: whether it’s autoscaling events such as “AZRebalance”, manual changes to an autoscaling group’s scale, an EC2 health check that is failing, or any component you may have developed that is requesting to terminate an EC2 instance/Kubernetes node.

The problem starts with this: AWS Autoscaling doesn’t know anything about the underlying Kubernetes nodes and resources, and as a result can terminate instances without first trying to drain node objects.

Outside of draining the nodes from pods that may be running, there might also be a need to drain ELB/ALB Targets if you are using kube-proxy to forward traffic between nodes along with alb-ingress-controller managed target groups, you might still need to de-register those instances from all target groups / classic ELBs before you proceed with the EC2 termination.

Failing to drain an ALB target before EC2-Terminate will almost always guarantee some dropped requests according to AWS documentation.

2. Pod level termination events

Once a pod is terminated due to eviction, scale-down or even a new deployment that is started, a pod goes through a process of termination which is captured here in great detail. But in essence, when a pod is terminated, a SIGTERM (15) is sent in parallel to the main process of each container and new TCP connections are no longer established, once the grace period is expired a SIGKILL (9) is sent to actually kill the running processes and continue with the pod deletion. The problem is that in the time between SIGTERM and SIGKILL, it is up to the application to gracefully drain and close the already established connections to these processes, and when we fail to do this, we can expect getting 502/504 error for any pod termination that may occur — pod churn is generally much more frequent than node churn, so this may be the first thing you should fix.

In this story we will mainly focus on node-level termination events and how we bridged the gap between AWS and Kubernetes to solve this, but it’s important that you are aware of the pod-level failures mentioned above.

The inherent race condition of asynchronous EC2 terminations

So now that we know the two main reasons why node terminations are causing 504/502 errors, what can we do about it? Initially we tried to take matters into our own hands..

"What if we create some script under /etc/rc0.d/ that causes the node to drain before termination happens?"

"What if we use AWS SSM to trigger some action right before the termination happens?"

Not only did we ask these questions, we tried using those techniques but they all failed miserably. Whatever we tried to do, EC2 termination was asynchronous and everything we tried to run was an inherent race condition. Sometimes the script ran successfully before the termination happened, and other times it didn’t…until we found out about… LIFECYCLE HOOKS!!

lifecycle-hooks gives an extremely powerful ability — having synchronous EC2 terminations. Which means we can hold the termination event until such time we find appropriate to let the EC2 instance go silently into the void.

This meant we could now bridge the gap between a ‘blind’ autoscaling group, into Kubernetes, and have an awareness of terminating instances within the cluster! We found a few projects that use lifecycle-hooks to drain nodes, but these were pretty old and unmaintained, and didn’t answer the second use-case of draining load balancers.

Keiko Lifecycle-Manager

We decided to design and build a cluster service that will intercept such hooks from the scaling group using SQS, and then process them by draining the node, and then draining the target while extending the waiter to prevent the instance from getting terminated.

Furthermore, we open sourced lifecycle-manager so anyone in the community can use it to solve these issues in their Kubernetes infrastructure running on AWS.

Lifecycle-manager is designed to sustain high load of concurrent termination events, as we use it on very large clusters with hundreds of nodes and dozens of target groups.

Proving it works

The interesting use case we wanted to verify works, was that we were getting 5xx due to ALB targets not getting drained and that lifecycle-manager now eliminates this problem - this was a pretty bad problem because it meant one tenant’s scaling activities were causing 504’s for other tenants when it was scaling down or rebalancing.

The reason this was happening is that kube-proxy was still holding connections from the ALBs to the nodes/pods, and because those targets were not getting drained properly, inflight requests get occasionally dropped.

Test Scenario — Sample application running on dedicated nodes and sustaining +/- 200 TPS of traffic, nodes on a separate scaling group (which are doing nothing) has it’s nodes randomly terminated via terminate-instance-in-auto-scaling-group API.

The red line indicates instances of 504 errors
The orange line indicates terminating EC2 instances
The blue line indicates HTTP requests

Tests results without lifecycle-manager (5xx visible)

Results with lifecycle-manager DISABLED:

Total Requests: 482941 (+/- 150TPS x 45minutes)
ELB HTTP 504: 137 (0.028%)

Test results with lifecycle-manager (no 5xx)

Results with lifecycle-manager ENABLED:

Total Requests: 482823 (+/- 150TPS x 45minutes)
ELB HTTP 504: 0

Conclusions

Always use terminate-instance-in-auto-scaling-group instead of terminate-instances.
Make sure your app knows how to shutdown gracefully when it gets a SIGTERM.
Use lifecycle-hooks to drain nodes.
Use lifecycle-hooks to drain ALB/ELB targets.
Lifecycle-Manager is awesome! You should star the repo.
Enjoy a cold drink while your apps are running 5xx free on Kubernetes.