Kubernetes Tips and Tricks — Health Checks

Published in

Hitachi Solutions Braintrust

5 min readNov 17, 2020

As an Orchestration Engine, Kubernetes can seem like absolute magic at times. The ability of this system to keep your services up and available with minimal human intervention is just short of miraculous for those of us who have struggled through maintaining production systems. However, the magic definitely has some sleight of hand behind the scenes, and two of the key tricks in the bag are Readiness Probes and Liveness Probes (there are also Startup Probes, which I will touch on briefly).

Probes

In the Kubernetes world, a probe is, more or less, a periodic call to some endpoint within a container and then tracks the success or failure of those calls. When subsequent calls fail, some action may be triggered by a probe. When the probe has subsequent successes after the failure event, some other action may be triggered. The three kinds of Probes that Kubernetes defines are Readiness Probes, Liveness Probes and Startup Probes.

Readiness Probe

A Readiness Probe is used to determine if a container is in a state to accept traffic. When this probe is successful, the load balancer allows traffic into this container. When it fails, traffic is halted to this container. As a real life example, I had a service that was responsible for generating large documents on the fly for users. This process was extremely processor and memory intensive, and could very quickly lead to a service becoming non-responsive. The last thing we wanted was for the containers for this service to receive traffic when it was under heavy use. We configured a readiness probe that was on a fairly short interval (15s) and with a very low failure threshold (two failures). This meant that should the service become overloaded, within 16–29 seconds all traffic to it would be shut down. While that seems like a long window, with round robin requests, and the frequency of use of the service, it was more than sufficient to shut down traffic until the service was no longer under heavy use.

This probe will only cause the container to be removed from the load balancer; it will never cause the container to be restarted by itself.

Liveness Probe

The Liveness Probe is designed to determine if the container is healthy or needs to be restarted. When it fails, the container is flagged to be restarted. Even if you never expect your containers to get into a state where they need to be restarted, you should always define a Liveness Probe with a general health check on any dependencies or anything important. To go back to my previous example, the service that generated those documents also had memory problems. Often it would leak memory sufficiently that it should be restarted about once a week to remain healthy. In our health check endpoint that the Liveness Probe was pointed at, we had several checks. It initially checked memory usage, and if memory exceeded a threshold, it would fail the Liveness Probe. Additionally, after one week of alive time, the container would trigger a failure in the Liveness Probe so that even the slow memory leaks were resolved. Finally, we did a check that the Database was accessible.

When the Liveness Probe fails, the container will restart. These probes are vital to keeping the service in a healthy state.

Startup Probe

There’s a third, special kind of probe called a Startup Probe. This probe is run at the initial start of the container, and gives a maximum startup time before Kubernetes even bothers with Liveness and Readiness checks. Sometimes services have long spin-up times. This isn’t ideal, and should probably be remedied, but until that can happen, you can use Startup Probes to ensure services have enough time to get running before Liveness and Readiness Probes begin to be run.

Similar to the above stories, I had a service that, on a new deployment, had to validate some database values and crunch them into a table if there was a discrepancy. At the time, I did not know about Startup Probes, so instead I used the initialDelaySeconds variable on my probes to have them wait to even start trying to call the health checks. However, this meant thif the container failed to start, I was still waiting the full length of that delay plus however many failures it was set to, potentially wasting several minutes on a bad start.

Additionally, this Startup Probe will kill the container if it does not start within its timeframe. This is very handy to provide a failure state on the initial load of the container, and a way to handle things if a container doesn’t even start (which is a significantly different error condition than a running container failing).

So How Do I Do It?

Probes are defined within a pod’s .yaml file at the container level. So, if I have a .yaml like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/liveness
    args:
    - /server
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 3

It is telling Kubernetes that for the pod liveness-http, I have a container, liveness, that has a livenessProbe at /healthz on port 8080. After 3 seconds of life, the pod will start pinging this container and will continue to ping it every 3 seconds. Now, this does not have a failureThreshold listed, which would let you set how many times it can fail before being considered in a bad state. By default this is set to 3, so after 3 seconds of life, the container would try to hit the /healthz endpoint. If it could not get a successful response from that endpoint after three times, it would flag the container as not alive, and would attempt to restart it. All this means that your pod could be in a failed state for 7–9 seconds before it was realized and steps taken to respond to it.

If we wanted a startup probe as well, we could modify it to add the startup probe like so:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/liveness
    args:
    - /server
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 3
      periodSeconds: 3
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

This configuration tries to hit the same endpoint, /healthz, from the moment the container starts, retrying every 10 seconds. If it fails 30 times, the container will be restarted.

Readiness probes follow a similar structure, so I won’t go into detail there. There are a ton of different variables you can use in your .yaml files to modify the behavior of these probes to get exactly what you want. The details of all of those variables are available in the Kubernetes Documentation. Additionally, there are some great Kubernetes Resources explaining the full pod lifecycle that are worth a read for understanding how this all works together to maintain the desired state of services.