Utilizing Kubernetes Liveness and Readiness Probes to Automatically Recover From Failure
With a few lines of YAML, you can turn your Kubernetes pods into auto healing wonders. The right combination of liveness and readiness probes used with Kubernetes deployments can:
- Enable zero downtime deploys
- Prevent deployment of broken images
- Ensure that failed containers are automatically restarted
If you’re not already familiar with liveness and readiness probes, the Kubernetes docs have a great introduction to liveness and readiness probes. Like most things in Kubernetes, the syntax involves some simple YAML. Both share the same syntax, and can be as simple as:
Updating deployments without readiness probes can result in downtime as old pods are replaced by new pods. If the new pods are misconfigured or somehow broken, that downtime extends until you detect the problem and rollback.
With readiness probes, Kubernetes will not send traffic to a pod until the probe is successful. When updating a deployment, it will also leave old replica(s) running until probes have been successful on new replica. That means that if your new pods are broken in some way, they’ll never see traffic, your old pods will continue to serve all traffic for the deployment.
Unfortunately this isn’t as useful after the deployment is complete and new replicas have replaced your old replicas. If something happens to cause your readiness probes to fail, the replicas will no longer be considered “ready”, and traffic will no longer be sent to them. If that happens to all your replicas, you could end up with a situation where users can’t even see your application’s error page.
One of the best ways to deal with this is to change the behavior of the route your readiness probe is checking. When we first started using readiness probes for our API, they pinged a route that not only ensured our API was running, but also that it was connected to all the databases it needed to be. Unfortunately this meant that if a database became unavailable, so did our API. Failing readiness probes would take all of our replicas out of circulation.
We’ve since changed to a different route that simply ensures that the service is running and responding to requests. This means that even if it is in an error state, it will still receive traffic and be able to respond with appropriate error messages.
Update: As Andy Hume pointed out in a response, this is less than ideal if a dependency only fails for a subset of pods. If error handling is all we’re removing dependency checks for, it may make more sense to move that further up the stack.
In many cases, it makes sense to complement readiness probes with liveness probes. Despite the similarities, they actually function independently. While readiness probes take a more passive approach, liveness probes will actually attempt to restart a container* if it fails.
Here’s what this might look like in a real life failure scenario. Let’s say our API encounters a fatal exception when processing a request.
- Readiness probe fails.
- Kubernetes stops routing traffic to the pod.
- Liveness probe fails.
- Kubernetes restarts the failed container*.
- Readiness probe succeeds.
- Kubernetes starts routing traffic to the pod again.
Pretty incredible, right? This is the kind of automated healing that makes Kubernetes incredible to work with. All of the above is made possible with just a few lines of YAML.
Probes for HTTP Services
The most noticeable impact these probes can have will be on HTTP services. The
livenessProbe attributes should be placed at the same level as the
image attributes for your containers in a deployment. For one of our services, the probe configuration looks something like:
The attributes are pretty straightforward here. The key ones to pay attention to are:
initialDelaySeconds How long to wait before sending a probe after a container starts. For liveness probes this should be safely longer than the time your app usually takes to start up. Without that, you could get stuck in a reboot loop. On the other hand, this value can be lower for readiness probes as you’ll likely want traffic to reach your new containers as soon as they’re ready.
timeoutSeconds How long a request can take to respond before it’s considered a failure. For us, 1 second is more than sufficient.
periodSeconds How often a probe will be sent. The value you set here depends on finding a balance between sending too many probes to your service or going too long without detecting a failure. In most cases we settle for a value between 10 and 20 seconds here.
Probes for Background Services
Although readiness and liveness probes are remarkably straightforward for services that expose an HTTP endpoint, it takes a bit more effort to probe background services. Here’s an example of probes we’re currently using for one of our background services:
- '`find alive.txt -mmin -1`'
- '`find alive.txt -mmin -1`'
Our background service touches
alive.txt every 15 seconds, and the probes test to ensure that file has been modified within 1 minute. This ensures that not only is the service running, it’s continuing to function as expected.
exec option for probes is incredibly powerful. We’re using a fairly simple command in the above example, but the sky’s the limit here. Each service is unique and this allows for a flexible way to ensure that things remain functional.
When to use Readiness and Liveness Probes
Despite how great readiness and liveness probes can be, they’re not always necessary. When updating deployments, Kubernetes will already wait for a replacement pod to start running before removing the old pod. Additionally, if a pod stops running, it will automatically try to restart it. Where these probes prove their worth is the time between when a pod starts running and when your service actually starts functioning. Kubernetes already knows if your container is running, probes let it know if your container is functioning.
One of our services is a Rails app running on Passenger. The time between when a pod starts running and the service is actually responding to requests can be significant (5–10 seconds). Without a readiness probe, we’d end up with at least that much downtime every time we updated that deployment.
Additionally, we have a number of services that can fail in ways that don’t always result in the container crashing. These aren’t particularly common, but when it happens, it’s nice to have a liveness** probe around to catch the issue and restart the container.
At Spire, readiness and liveness probes have proven their worth several times over. They’re yet another feature that show how powerful Kubernetes can be.
We’ve made our share of mistakes along the way, properly configured readiness and liveness probes can be a lifesaver when those mistakes happen.
*Updated from pod to container here. As comments pointed out, probes run directly against containers and therefore restart the containers they’re running against.
**Updated from readiness to liveness here, thanks to the comments that caught this mistake in the original version.