Introducing TargetGroupController — A way to more efficiently run your Kubernetes services

Houseparty
2 min readOct 4, 2018

--

Ted Hahn, DevOps at Houseparty

Twitter

Here at Houseparty, we use AWS ALBs for Load Balancing, SSL Fronting, and more. We find them to be an effective and reliable tool to manage our traffic and do service routing. What we did not find them effective for was doing health checking. Bad Health Checking causes users to experience HTTP 500s when they shouldn’t — Sometimes we know that backends are unhealthy, particularly around scaling events.

A primer on how Health Checking works

Health Checking is the process of making requests to your service or backend to assure that it is still up. This can help with problem detection and remediation, and is often used to deal with services with a slow infancy — Instances are not added to the serving pool until they begin passing health checks. Kubernetes uses both — LivenessProbes to tell if pods have become unresponsive and automatically restart them, and ReadinessProbes to tell if pods are ready to serve traffic.

TargetGroupController — Shim between Kubernetes Services and AWS ALB/ELBv2

TargetGroupController is a simple kubernetes styled controller to solve a simple issue. It translates between Kubernetes Services/Endpoint objects and AWS TargetGroups. For a given Kubernetes Service, it watches to rectify the list of IPs in the service ready endpoints with the list of IPs in the TargetGroup.

Problem statement

The built-in Kubernetes Service controller can create Amazon ELBs. The routing diagram for these ELBs, however, presents a problem. The service creates a Nodeport, and then registers each node as a backend to the ELB. The health checks that the ELB performs check the health of the *nodes*, not of the underlying *pods*. There may be multiple connections to each node, each with a different backing pod. These pods do not correspond in any way to the node — You can connect to any node, including ones that aren’t running any pods associated with your service. Since it is the nodes that are health checked, a single connection to a healthy pod could be taken as proof of health for multiple connections, even if they are attached to different, unhealthy pods.

To rectify this, we need to improve routing so that the ALB is aware of the state of each individual pod. We do this by using an IP type TargetGroup. Caveat: This requires that the pods be routable within the VPC — You can’t be using Calico or other network models that do not expose pod IPs externally.

Using the TargetGroupController (TGC):

  1. Create a ALB, and then create a TargetGroup of type ‘ip’. Note down it’s ARN.
  2. Create an IAM user with the template in aws/policy.json, and put it’s credentials into the secret aws-targetgroupcontroller.
  3. Create the ServiceAccount, Service, and Deployment in the k8s directory, substituting in your service name and the TargetGroup’s ARN.

Running the TargetGroupController:

TGC Implements a Prometheus monitoring endpoint. Example rules are in the prometheus directory.

Building:

TargetGroupController is built and pushed by bazel — All that’s needed is to run:

bazel run :push_dockerimage "—-embed_label=$(git rev-parse HEAD)"

See the BUILD file for other targets.

--

--