Expedia Group Technology — Software

Disable Traffic to a Kubernetes Application During an Incident

How to recover when replicas are overwhelmed?

Ori Rawlings
Expedia Group Technology

--

“Vehicle in Road at Golden Hour”, by Pixabay, is licensed under CC0

Have you ever been on-call during an incident where your application was overwhelmed by requests? Maybe some replicas of the application failed and began restarting, but the remaining replicas could not sustain the traffic alone? Maybe that led to a prolonged impairment, where you just couldn’t get 100% of replicas available at the same time?

Kubernetes generally does a pretty good job. Each replica of your application runs as a Pod. Kubernetes uses Services to load balance requests to each replica. Kubernetes continuously probes each replica for readiness and liveness. If probes fail, Kubernetes will automatically take a replica out of load balancer rotation or restart the process. This allows us to recover automatically if only a few replicas fail simultaneously.

Sometimes, the conditions that caused the failure of initial, become more probable once fewer replicas are serving traffic. Thus, the failure cascades between replicas repeatedly because we never return to a state where 100% of capacity is in rotation. This can lead to prolonged impairment. In these scenarios, the most immediate way to recover is to disable all traffic, allow 100% of replicas to stabilize, and only then re-enable traffic. Unfortunately, this is not something Kubernetes will do automatically.

So if we want to disable all traffic to an application in Kubernetes, how do we do it?

Each Kubernetes application has a Service. The service acts as a virtual load balancer. It uses something called a selector to discover the replicas of your application. The selector queries Kubernetes for pods that have certain labels applied. If a pod matches, the service will route requests to that pod.

For example, consider an application, arbitrarily called quack, which has the following Service definition.

$ kubectl get service quack -o yaml
apiVersion: v1
kind: Service
metadata:
name: quack
spec:
...
selector:
app: quack
release: quack-prod
...

This selector looks for all pods with two labels: app=quack and release=quack-prod. We can query for pods that match these labels to see which pods the service will send requests to. In this case, the service selector matches the single quack pod in production.

$ kubectl get pods --selector app=quack,release=quack-prod
NAME READY STATUS RESTARTS AGE
quack-6685676c68-qvgj5 1/1 Running 0 2d5h

To disable traffic, we simply edit the service’s selector so that it will no longer match any pods. Without any matching pods in rotation, all requests will fail (typically with something like an HTTP 404 response). We can do this by including an additional label in the selector that pods don’t use (ex. offline: please).

$ kubectl edit service quack
apiVersion: v1
kind: Service
metadata:
name: quack
spec:
...
selector:
app: quack
release: quack-prod
offline: please
...

We can confirm that no pods will match the new selector query.

$ kubectl get pods --selector app=quack,release=quack-prod,offline=please
No resources found.

We can also confirm that the service doesn’t have any Endpoints in rotation. Endpoints are resources in the Kubernetes API that represent the current state of the load balancer service.

$ kubectl get endpoints quack
NAME ENDPOINTS AGE
quack <none> 254d

Once all pods have recovered, we can simply re-edit the service to resume routing traffic to the application replicas.

$ kubectl edit service quack
apiVersion: v1
kind: Service
metadata:
name: quack
spec:
...
selector:
app: quack
release: quack-prod
...

We can confirm that the endpoints, once again, include our application pod(s).

$ kubectl get endpoints quack
NAME ENDPOINTS AGE
quack 10.3.57.199:8081,10.3.57.199:8080 254d

Learn more about technology at Expedia Group.

--

--