Ingress load balancing issues on Google’s GKE
Usually my posts here are about some thing I think I might have figured out and want to share. Today’s post is about a thing I’m pretty sure I haven’t figured out and want to share. I want to talk about a problem we’ve been wrestling with over the last couple of weeks; one which we can suggest a potential fix for but do not yet know the root cause of. In short, if you are running certain types of services behind a GCE class ingress on GKE you might be getting traffic even when your pods are unready, as during a deployment for example. Before I get into the details here is the discovery story. If you just want the tl;dr and recommendations jump to the end.
[Update 4/17/2019 — Google’s internal testing confirmed that this is a problem with the front end holding open connections to kube-proxy and reusing them. Because the netfilter NAT table rules only apply to new connections this effectively short-circuited kubernetes’ internal service load balancing and directed all traffic to a given node/nodeport to the same pod. Google also confirmed that removing the keep-alive header from the server response is a work-around, and we’ve confirmed this internally. If you need the keep-alive header then the next best choice is to move to container native load balancing with a VPC-native cluster, since this takes the nodeport hop right out of the equation. Unfortunately that means building a new cluster if yours is not already VPC-native. So that is the solution… if you’re still interested in the story read on!]
Over the last couple of months we’ve been prepping one of our most critical services for migration to GKE. This service consists partly of an http daemon that handles long poll requests from our javascript client, and runs on 90 GCE instances. These instances handle approximately 15k requests per second at peak load. Because many of these requests are long polls with a timeout of 30 seconds we need the ability to gracefully shut down instances of this service. To accomplish this we have a command we can send the service that causes it to take itself out of rotation, wait 60 seconds for all existing long polls to complete, and then exit.
To replicate this in kubernetes we took advantage of three important and useful features: readiness probes, lifecycle hooks, and the termination grace period for pods. Readiness probes let you define a command the kubelet calls to determine if the pod should receive traffic, in our case an http endpoint. Lifecycle hooks specify a command to be executed in the container before the kubelet sends it SIGTERM on delete. This is how we send our shutdown command. The termination grace period sets the maximum amount of time the kubelet will wait for the lifecycle hook to complete before sending SIGTERM.
All of this seemed to work quite well in staging, but alas we all know the deal there. After we spun up some production pods behind a GCE class ingress and began to migrate traffic to them Nick MacInnis, lead engineer on this service, began to see something odd: during a deployment unready pods were still receiving traffic. This is a bad thing, as it means we will have outstanding connections when the pod is finally deleted and our clients will see a flurry of broken connection errors. Testing confirmed his observations. Not only were the pods getting traffic, but they were getting new connections. In fact we could simply curl the load balancer endpoint in a loop and get traffic to unready pods.
The first question we asked ourselves was: are the pods actually unready? When a pod fails its readiness probe (either by returning something other than 200 to an http probe, returning 1 from a command probe, or by exiting if no probe is defined) it should be marked as unready by the master, its endpoint should be marked unready by the endpoint controller, the endpoint should be removed from the service’s endpoints list, and the kube-proxy on each node should remove that endpoint from the chain in the NAT table that distributes traffic to pods.
Our testing confirmed that all these things were happening just as described: the pod was marked as unready, the endpoints controller marked the endpoint unready, the service removed it. Because traffic to a service from an ingress (i.e. through a nodeport) is SNAT’d onto the pod network on the receiving node it was easy to see what nodes traffic was coming from. When we identified and connected to a source node we could see that the pod’s endpoint was removed from the NAT table as soon as kube-proxy got the update, which was typically 5–10 seconds after the pod itself was marked unready.
And yet it was getting traffic. We opened up a case with Google Cloud support and continued testing. I should note here that we have always had good experiences with Google support. They care about the service and they’re quick to dive into even complex problems. Over the past couple of years they have helped us solve a lot of issues, but they were as perplexed by this one as we were. Internally we focused our efforts on reproducing the issue in a more controlled environment. In our view there was nothing special about this http server or the ingress so if this was a thing it should be a thing for any similarly-configured GKE service.
We began by creating a simple server using python3 and flask, with three endpoints:
from flask import Flask
from flask import Responseapp = Flask("ready-test")READY = True@app.route('/', methods=['GET'])
def handle_root():
resp = Response("hello to you sir or madam")
return resp@app.route('/ready/', methods=['GET'])
def handle_ready():
global READY
if READY:
return "ready"
else:
return Response("unready", status=503)@app.route('/unready/', methods=['GET'])
def handle_unready():
global READY
if READY:
READY = False
return "server is unready"
else:
READY = True
return "server is ready"
The root path is called by the test client and just returns some junk and http 200. The /ready/ path is intended to be called from the readiness probe and returns either 200 or 503 based on the state of a global variable. The /unready/ path just toggles the state of that variable and is the means for us to control the ready state of the pod.
We then added a deployment, service and ingress:
kind: Deployment
apiVersion: apps/v1
metadata:
name: ready-test-http
labels:
app: ready-test
spec:
selector:
matchLabels:
deployment: ready-test-http
template:
metadata:
labels:
app: ready-test
deployment: ready-test-http
spec:
containers:
- name: ready-test
image: ready-test
ports:
- containerPort: 8080
name: http
readinessProbe:
httpGet:
path: /ready/
port: 8080
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 2
---kind: Service
apiVersion: v1
metadata:
name: ready-test
labels:
app: ready-test
spec:
type: NodePort
selector:
deployment: ready-test-http
ports:
- name: http
port: 80
targetPort: http
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: ready-test
labels:
app: ready-test
annotations:
kubernetes.io/ingress.class: "gce"
spec:
rules:
- http:
paths:
- path: /
backend:
serviceName: ready-test
servicePort: 80
- path: /ready/
backend:
serviceName: ready-test
servicePort: 80
- path: /unready/
backend:
serviceName: ready-test
servicePort: 80
The specific timings on the readiness probe were chosen to replicate what we do in our production service, the rest is pretty much straight-up boilerplate nothing-to-see-here yaml. We deployed, confirmed we could toggle the ready state, started up a loop to curl the root path and followed the pod log. We then toggled the ready state, saw the pod marked as unready and about 5 or so seconds later saw the NAT tables on the nodes get updated… and boom the traffic from the load balancer stopped. The only thing we could see coming in were the readiness probes from the kubelet and the GCLB front end.
We went back and took a closer look at the production service. Something was obviously different. After poking around we realized that the twisted http server we’re using sets the “Connection: keep-alive” header by default, whereas our test app did not. We added code to do that, which changed the implementation of the root path handler to this:
@app.route('/', methods=['GET'])
def handle_root():
resp = Response("hello to you sir or madam")
resp.headers['Connection'] = 'Keep-Alive'
resp.headers['Keep-Alive'] = 'timeout=30'
return resp
The “Keep-alive: timeout=30” header is something I added in testing. I’m not sure it’s relevant but I’ll leave it in there since it was present for the next test we ran. We redeployed the test service, repeated the above steps… and continued to see requests to / after the pod became unready. Let’s take a closer look at that test.
In this screen you can see that the test pod is ready, and we are only receiving requests to /ready/ from the kubelet (on 10.0.30.1) and the GCLB (the requests coming from node IPs on the 10.130.0.0/20 network). We then started up a curl with the following command:
$ while true; do curl -m 1 http://xxx.xxx.xxx.xxx/; done
The purpose of the “-m 1” argument is so that curl will not wait on the 30 second GCLB timeout but will continue to try to send requests at least once per second.
In this screen the pod is still ready and you can see the requests to / coming in from the curl client. Note that because of the previously mentioned SNATing these also appear to be coming from the nodes. We then toggled the pod state to unready.
Here you can see the pod is unready, we are returning 503 to the readiness probes, and yet we are still getting requests to the / endpoint from our curl client. Note that even though we only have a single pod here, and that pod is unready, the client is not seeing errors when these requests get through.
In this output from the curl command you can see it’s timing out at the requested 1000 ms most of the time, but every now and then it gets a response. We let this test run for several minutes and continued to see sporadic requests. The test client is running around 1 rps in the worst case so you can easily imagine the impact this could have if you scale up to 15k rps.
We did several more rounds of testing and confirmed to our own satisfaction that when we returned the keep-alive header in the server response some requests would get through, when we removed that header things worked as expected and traffic stopped as soon as all the nodes’ NAT tables were updated.
So what’s the tl;dr here? Sometimes unready pods get requests. We’re still working on this issue with support and the GKE team is investigating. When we learn what the root cause is I’ll follow up here. We have our own theories about connections getting held open and reused somewhere in the chain (mostly spurred by the fact that the NAT table rules only apply to new connections), but really it’s speculation and we don’t know. Load balancing through an ingress is complicated, with two separate levels of health checking and round-robin connection balancing going on. If it were a problem with kube-proxy we should be able to see it. So far it doesn’t seem like that’s the case, and we don’t have the same level of visibility on the GCLBs. We’ll have to leave that to the Google team.
In the meantime if you run an http service on GKE behind a GCE class ingress and your service can become unready while remaining live (i.e. can pass the liveness probe and fail the readiness probe) and your server is returning the keep-alive header you may want to take a close look at what’s happening during deployments. We see nothing to suggest that this is tied to specifics of our application or environment and it would be surprising if it is not happening elsewhere too. We also definitely don’t know that removing the keep-alive header is a complete fix.
Thanks to Nick MacInnis and Kyle Owens at Olark for their help digging into this issue, and to Yahir at Google Cloud Support for all his assistance with it to date.