Kubernetes liveness and readiness probes with Spring Boot

Fernando González Cortés
aiincube-engineering
11 min readMay 12, 2020

One of the great features of Kubernetes is that it can make your setup adapt to the current situation in real time. For example, it can automatically scale the number of pods, it can retry jobs until success is achieved, it can detect if a container is not working properly and avoid sending traffic to it, etc.

In this post we will focus on the last example: how can Kubernetes detect if a container is working properly and what can it do if it is not. We will run a Spring Boot application in the container and we will also see how it can communicate the failures to Kubernetes.

Disclaimer: at the moment of writing, the necessary Spring Boot functionalities are not yet released. We will use a development version, a SNAPSHOT of 2.3.0.

The aim is to get something like what is shown in the next screencast:

What we see in the beginning is Kubernetes load balancing the requests to two replicas. At some point one of the replicas gets disabled, Kubernetes notices it and sends all the traffic to the other replica. After 4 seconds, the disabled replica gets ready again and Kubernetes routes traffic to it again.

In the rest of the post I will show how this is done. We will test the three following configurations:

  1. No application-Kubernetes communication. Kubernetes does not notice the service readiness/liveness.
  2. Probes on the API endpoint. Kubernetes notices the service liveness but not its readiness.
  3. Spring Boot Actuator + Kubernetes probes. Kubernetes notices both the service readiness and liveness and reacts accordingly.

Our example application

The code used in this post can be downloaded from this GitHub repository which contains tags for the different configurations that we are going to try. You can take a look at the code under the “configuration-1” tag:

git clone git@github.com:parknav/blog-kubernetes-boot-readylive.git
cd blog-kubernetes-boot-readylive/
git checkout configuration-1

We are using a Spring Boot web application that just outputs an identifier in the /id endpoint:

Note that there are two ifs, which we will use to make the application fail artificially. The returned id is created at startup and does not change:

We use https://github.com/kkuegler/human-readable-ids-java in order to generate human readable ids.

In addition to this endpoint we will have a couple of endpoints that make the system artificially fail:

  • /disable: breaks the /id endpoint during 4 seconds. Note that when the temporarilyBroken flag is set to true the /id endpoint does not return the id but an error message instead.
  • /break: makes all subsequent requests fail. Note that when the broken flag is set to true the /id endpoint fails with an exception.

Deploying in Minikube

We will use Minikube as Kubernetes cluster. Let’s turn it on:

$ minikube start

Now we need to create a container image for our application. At Ai Incube we use Jib a lot. Jib configuration is quite straightforward. In build.gradle:

jib {
to {
image='readylive'
tags=['v1']
}
}

Normally we would just execute gradle jib, but in this case we want to communicate with the Docker daemon running in the Minikube instance. For that we need to execute:

$ eval $(minikube docker-env)

Afterwards we can invoke:

$ gradle jibDockerBuild

Which will make the image available in the Minikube cluster.

Kubernetes configuration

Kubernetes configuration can be found in the k8s.yaml file. It consists on a Deployment with two replicas and a Service load-balancing the requests. The fact that we have two replicas will help us to see see how Kubernetes routes traffic when one replica is not ready:

apiVersion: apps/v1
kind: Deployment
metadata:
name: readylive
labels:
app: readylive
spec:
replicas: 2
selector:
matchLabels:
app: readylive
template:
metadata:
labels:
app: readylive
spec:
containers:
- name: readylive
image: readylive:v1
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: readylive
spec:
ports:
- port: 80
targetPort: 8080
selector:
app: readylive
type: LoadBalancer

We apply the configuration to Minikube with this command:

$ kubectl apply -f k8s.yaml

After which we should be able to see a couple of pods running:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
readylive-594bcdcc9d-t5twr 1/1 Running 0 6s
readylive-594bcdcc9d-w7x77 1/1 Running 0 6s

Testing

The test.sh script will help us to test the application. It performs the following steps:

  1. 2s querying the /id endpoint with a 0.1s pause between requests.
  2. A request to either /disable or to /break (depending on the passed parameter)
  3. 20s querying the /id endpoint with a 0.1s pause between requests.

For example, if we call ./test.sh disable we will get this output (the output of the tests is often trimmed to avoid too much redundant information):

$ ./test.sh disable
ordinary-wolverine-60 <- Both services getting traffic
cold-dolphin-42
ordinary-wolverine-60
ordinary-wolverine-60
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
request to disable <- ordinary-wolverine gets disabled
Disabled 4s
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled <- Kubernetes keeps sending traffic
Service temporarily disabled to the disabled replica
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
Service temporarily disabled
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
Service temporarily disabled
ordinary-wolverine-60 <- The replica got enabled again.
ordinary-wolverine-60 All back to normal.
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
ordinary-wolverine-60
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42

If we call ./test.sh break, this is, in the context of a persistent failure, the result is similar, only with the failing replica will not recover:

$ ./test.sh break
ordinary-wolverine-60 <- Both services getting traffic
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42

request to break <- cold-dolphin breaks
Done. All broken now.

ordinary-wolverine-60 <- Requests routed to the broken
{"error": ..., "status": 500,...} replica get a 500 status code
{"error": ..., "status": 500,...}
ordinary-wolverine-60
ordinary-wolverine-60
ordinary-wolverine-60
ordinary-wolverine-60
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ordinary-wolverine-60
{"error": ..., "status": 500,...}
ordinary-wolverine-60
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ordinary-wolverine-60
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ordinary-wolverine-60
ordinary-wolverine-60

In these examples Kubernetes didn’t react to the fact that a service was temporarily or permanently malfunctioning. How can we make Kubernetes notice?

Liveness and readiness probes

The Kubernetes official documentation explains very well what are liveness and readiness probes for:

  • The kubelet uses liveness probes to know when to restart a container.
  • The kubelet uses readiness probes to know when a container is ready to accept traffic.

Essentially, liveness and readiness probes tell Kubernetes when to restart or send traffic to a container.

The code under the configuration-2 tag sets up these probes, which consists on adding the readinessProbe and livenessProbe sections to the Kubernetes manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
name: readylive
[...]
spec:
[...]
template:
[...]
spec:
containers:
- name: readylive
image: readylive:v1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /id
port: 8080
failureThreshold: 1
periodSeconds: 1
livenessProbe:
httpGet:
path: /id
port: 8080
failureThreshold: 1
initialDelaySeconds: 8
periodSeconds: 1

If a probe gets a single failure (failureThreshold is 1), it will trigger an action:

  • The readiness probe will make Kubernetes stop routing traffic to the concerned replica.
  • The liveness probe will make Kubernetes restart the container.

The readiness probe will query the /id endpoint each second and the container will become ready if the probe is successful and false otherwise.

The case of the liveness probe is a bit different. We need to add an initial delay of 8 seconds (initialDelaySeconds) before starting the probes, in order to let the Spring Boot application start. Otherwise the container will enter the following crash loop:

  1. Container is created
  2. 1 second later the liveness checks the /id endpoint, which is not yet ready.
  3. Liveness probe fails and the container is restarted.
  4. Go to 1

If we run the tests with these two probes we will see the liveness probe working well and restarting the application when the /id endpoint fails. However, when we disable a replica no probe will fail (/id still returns 200 status codes) and Kubernetes will go on sending traffic to that replica.

We can redeploy this manifest with kubectl apply -f k8s.yaml and execute the previous tests.

$ ./test.sh disable
great-dingo-86 <- Both services getting traffic
tricky-rattlesnake-22
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
great-dingo-86
great-dingo-86
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
great-dingo-86

request to disable <- great-dingo gets disabled
Disabled 4s

Service temporarily disabled <- Kubernetes still sends traffic to
tricky-rattlesnake-22 the disabled replica because the
tricky-rattlesnake-22 status code is 200
tricky-rattlesnake-22
Service temporarily disabled
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
Service temporarily disabled
[...]
Service temporarily disabled
tricky-rattlesnake-22
Service temporarily disabled
Service temporarily disabled
tricky-rattlesnake-22
tricky-rattlesnake-22
Service temporarily disabled
tricky-rattlesnake-22
tricky-rattlesnake-22
Service temporarily disabled
great-dingo-86 <- great-dingo is back
great-dingo-86
great-dingo-86
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
great-dingo-86
great-dingo-86

Hitting the /break endpoint will work as expected because it will make the /id endpoint fail, which in turn causes the liveness probe to fail and triggers a reboot of the container.

$ ./test.sh break                 <- Both services getting traffic
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
great-dingo-86
great-dingo-86
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
request to break <- great-dingo breaks.
Done. All broken now.
{"error": ..., "status": 500,...}
<- Kubernetes restarts great-dingo
around here
curl: (7) Failed to connect
tricky-rattlesnake-22 <- All traffic routed to tricky-
tricky-rattlesnake-22 rattlesnake while the other
tricky-rattlesnake-22 replica restarts
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
[...] <- It takes a while to restart
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tasty-badger-96 <- Finally the container finishes
tasty-badger-96 restarting and both replicas
tricky-rattlesnake-22 get traffic
tasty-badger-96
tasty-badger-96
tricky-rattlesnake-22
tasty-badger-96
tricky-rattlesnake-22
tricky-rattlesnake-22

How could we prevent Kubernetes from routing traffic to the replica during those 4 seconds? We could try to make the endpoint fail when we disable it, so that it returns a 500 status code and makes the probe fail. However that would make the liveness probe also fail and will cause a reboot of the container.

Indeed the problem is that we have one endpoint to do many things: serve API requests, indicate readiness and indicate liveness. In real life, though, we may want to avoid doing the probes against the API endpoint because it may have undesirable side effects, it may be expensive or we don’t want to have the probes in our logs.

We need then an API endpoint and two endpoints for the probes. Spring Boot Actuator provides the last ones out of the box.

Using Actuators

Among other things, Spring Boot Actuator installs a /health endpoint, which will show if the system is working properly or not. And it is enough to add the dependency in build.gradle in order to get the /health endpoint working:

dependencies {
[...]
implementation 'org.springframework.boot:spring-boot-starter-actuator'
[...]
}

Note that if you run the application in a non-Kubernetes environment you will just get a single piece of information:

$ curl http://localhost:8080/actuator/health
{"status":"UP"}

But when the same endpoint is reached in Minikube we have the some more output:

$ gradle jibDockerBuild
$ kubectl delete pod -lapp=readylive
$ curl "$(minikube service --url=true readylive)/actuator/health"
{"status":"UP","groups":["liveness","readiness"]}

And we have a couple of new endpoints:

$ curl "$(minikube service --url=true readylive)/actuator/health/readiness"
{"status":"UP"}
$ curl "$(minikube service --url=true readylive)/actuator/health/liveness"
{"status":"UP"}

We can tell Kubernetes to use these endpoints in our probes just replacing the URLs used in the manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
name: readylive
[...]
spec:
[...]
template:
[...]
spec:
containers:
- name: readylive
image: readylive:v1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
failureThreshold: 1
periodSeconds: 1
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 1
initialDelaySeconds: 8
periodSeconds: 1

We also have to update the status of those endpoints in response to the /disable and /break endpoints. We will use the AvailabilityChangeEvent.publish method in order to communicate to Spring the changes in readiness and liveness:

These changes are applied in the repository under the configuration-3 tag. With them, the tests will behave as we want:

$ ./test.sh disable
stupid-cat-31 <- Both services getting traffic
stupid-cat-31
ugly-eel-70
stupid-cat-31
ugly-eel-70
stupid-cat-31
stupid-cat-31
stupid-cat-31
stupid-cat-31
stupid-cat-31
ugly-eel-70
stupid-cat-31
ugly-eel-70
ugly-eel-70
stupid-cat-31
stupid-cat-31

request to disable <- stupid-cat gets disabled
Disabled 4s

Service temporarily disabled
Service temporarily disabled
Service temporarily disabled
ugly-eel-70 <- Readiness probe fails and
ugly-eel-70 Kubernetes stops sending traffic
ugly-eel-70 to stupid-cat
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
[...]
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
stupid-cat-31 <- Readiness probe works again and
ugly-eel-70 traffic is sent to both replicas
ugly-eel-70
stupid-cat-31
stupid-cat-31
ugly-eel-70
ugly-eel-70
ugly-eel-70
stupid-cat3
$ ./test.sh break                 <- Both services getting traffic
stupid-cat-31
ugly-eel-70
ugly-eel-70
stupid-cat-31
stupid-cat-31
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70

request to break <- stupid-cat breaks
Done. All broken now.

{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ugly-eel-70
{"error": ..., "status": 500,...}
ugly-eel-70
ugly-eel-70 <- It seems that here Kubernetes
restarted the service but still
curl: (7) Failed to connect routed a request to it
ugly-eel-70 <- All traffic routed to ugly-eel
ugly-eel-70 while the other replica
ugly-eel-70 restarts
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
[...] <- It takes a while
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
cold-badger-56 <- Finally the container finishes
ugly-eel-70 restarting and both replicas
cold-badger-56 get traffic
ugly-eel-70
cold-badger-56
cold-badger-56
ugly-eel-70
cold-badger-56
cold-badger-56
ugly-eel-70
cold-badger-56
ugly-eel-7g

--

--