Artwork by Oliver Lake

Kubernetes ELB Timeout Blues

John Murphy
Redbubble

--

We’ve been running Kubernetes in production on AWS for nearly a year and a half here at Redbubble and while mostly things have been humming along just peachy (albeit with the occasional curveball), we _have_ had one long-running bugbear. The crux of the problem was this: any kubectl command that requires a long-lived connection to the Kubernetes API server got closed prematurely. The commands that suffered from this behaviour were:

  • kubectl logs — follow
  • kubectl rollout status
  • kubectl exec

All of these would work for roughly a minute or so before the connection dropped and you were left feeling a little sad. We realised we had this problem quite early on, but it was mostly regarded as an irritant — not one of our most pressing engineering problems. What we discovered during our initial investigation was that it looked like the ELB was ending the connection prematurely. During this investigation, we also stumbled across [this thread](https://github.com/kubernetes/kubernetes/issues/15702), the last comment popped up right in the middle of our investigation:

This is usually the load balancer timeout in front of an apiserver. Check the timeout from the ELB.

So we increased the ELB timeout and left it alone.

This didn’t stop the problem from occuring, it just reduced the rate at which our engineers had problems with it. Later last year AWS announced cross-zone capability for Network Load Balancers (NLB) and re-labelled ELB’s to Classic Load Balancers. NLBs, operating on a lower level of the network stack, don’t implement the same keep-alive logic. We thought swapping out the ELBs for NLBs might just do the trick to make this problem go away once and for all.

As an aside, we briefly considered using Application Load Balancers (ALB) instead, but quickly realised that the need to terminate SSL at the ALB was going to cause us problems (the kubernetes API likes to do that itself).

As it Turns out, our hunch that NLBs might do the trick was correct! Getting there, though, was slightly trickier than we first envisioned. Getting the NLBs up and running was reasonably trivial, the bit that caught us by surprise was the fact that an NLB must reside in the same subnet as the EC2 instances it targets. This is in the documentation, we were just really bad at reading that day and kept missing it. Once we shifted the controllers to the same subnet as our NLB’s we were in business and all of our connection issues evaporated!

--

--