Kubernetes liveness and readiness probes with Spring Boot

Published in

aiincube-engineering

11 min readMay 12, 2020

One of the great features of Kubernetes is that it can make your setup adapt to the current situation in real time. For example, it can automatically scale the number of pods, it can retry jobs until success is achieved, it can detect if a container is not working properly and avoid sending traffic to it, etc.

In this post we will focus on the last example: how can Kubernetes detect if a container is working properly and what can it do if it is not. We will run a Spring Boot application in the container and we will also see how it can communicate the failures to Kubernetes.

Disclaimer: at the moment of writing, the necessary Spring Boot functionalities are not yet released. We will use a development version, a SNAPSHOT of 2.3.0.

The aim is to get something like what is shown in the next screencast:

What we see in the beginning is Kubernetes load balancing the requests to two replicas. At some point one of the replicas gets disabled, Kubernetes notices it and sends all the traffic to the other replica. After 4 seconds, the disabled replica gets ready again and Kubernetes routes traffic to it again.

In the rest of the post I will show how this is done. We will test the three following configurations:

No application-Kubernetes communication. Kubernetes does not notice the service readiness/liveness.
Probes on the API endpoint. Kubernetes notices the service liveness but not its readiness.
Spring Boot Actuator + Kubernetes probes. Kubernetes notices both the service readiness and liveness and reacts accordingly.

Our example application

The code used in this post can be downloaded from this GitHub repository which contains tags for the different configurations that we are going to try. You can take a look at the code under the “configuration-1” tag:

git clone git@github.com:parknav/blog-kubernetes-boot-readylive.git
cd blog-kubernetes-boot-readylive/
git checkout configuration-1

We are using a Spring Boot web application that just outputs an identifier in the /id endpoint:

Note that there are two ifs, which we will use to make the application fail artificially. The returned id is created at startup and does not change:

We use https://github.com/kkuegler/human-readable-ids-java in order to generate human readable ids.

In addition to this endpoint we will have a couple of endpoints that make the system artificially fail:

/disable: breaks the /id endpoint during 4 seconds. Note that when the temporarilyBroken flag is set to true the /id endpoint does not return the id but an error message instead.

/break: makes all subsequent requests fail. Note that when the broken flag is set to true the /id endpoint fails with an exception.

Deploying in Minikube

We will use Minikube as Kubernetes cluster. Let’s turn it on:

$ minikube start

Now we need to create a container image for our application. At Ai Incube we use Jib a lot. Jib configuration is quite straightforward. In build.gradle:

jib {
    to {
        image='readylive'
        tags=['v1']
    }
}

Normally we would just execute gradle jib, but in this case we want to communicate with the Docker daemon running in the Minikube instance. For that we need to execute:

$ eval $(minikube docker-env)

Afterwards we can invoke:

$ gradle jibDockerBuild

Which will make the image available in the Minikube cluster.

Kubernetes configuration

Kubernetes configuration can be found in the k8s.yaml file. It consists on a Deployment with two replicas and a Service load-balancing the requests. The fact that we have two replicas will help us to see see how Kubernetes routes traffic when one replica is not ready:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: readylive
  labels:
    app: readylive
spec:
  replicas: 2
  selector:
    matchLabels:
      app: readylive
  template:
    metadata:
      labels:
        app: readylive
    spec:
      containers:
        - name: readylive
          image: readylive:v1
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: readylive
spec:
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: readylive
  type: LoadBalancer

We apply the configuration to Minikube with this command:

$ kubectl apply -f k8s.yaml

After which we should be able to see a couple of pods running:

$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
readylive-594bcdcc9d-t5twr   1/1     Running   0          6s
readylive-594bcdcc9d-w7x77   1/1     Running   0          6s

Testing

The test.sh script will help us to test the application. It performs the following steps:

2s querying the /id endpoint with a 0.1s pause between requests.
A request to either /disable or to /break (depending on the passed parameter)
20s querying the /id endpoint with a 0.1s pause between requests.

For example, if we call ./test.sh disable we will get this output (the output of the tests is often trimmed to avoid too much redundant information):

$ ./test.sh disable
ordinary-wolverine-60           <- Both services getting traffic
cold-dolphin-42
ordinary-wolverine-60
ordinary-wolverine-60
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60request to disable              <- ordinary-wolverine gets disabled
Disabled 4scold-dolphin-42
cold-dolphin-42
Service temporarily disabled    <- Kubernetes keeps sending traffic
Service temporarily disabled       to the disabled replica
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
Service temporarily disabled
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
Service temporarily disabled
cold-dolphin-42
cold-dolphin-42
Service temporarily disabled
Service temporarily disabled
ordinary-wolverine-60           <- The replica got enabled again. 
ordinary-wolverine-60              All back to normal.
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
ordinary-wolverine-60
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42

If we call ./test.sh break, this is, in the context of a persistent failure, the result is similar, only with the failing replica will not recover:

$ ./test.sh break
ordinary-wolverine-60               <- Both services getting traffic
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42
ordinary-wolverine-60
cold-dolphin-42
cold-dolphin-42
cold-dolphin-42

request to break                    <- cold-dolphin breaks
Done. All broken now.

ordinary-wolverine-60               <- Requests routed to the broken 
{"error": ..., "status": 500,...}      replica get a 500 status code
{"error": ..., "status": 500,...}
ordinary-wolverine-60
ordinary-wolverine-60
ordinary-wolverine-60
ordinary-wolverine-60
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ordinary-wolverine-60
{"error": ..., "status": 500,...}
ordinary-wolverine-60
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ordinary-wolverine-60
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ordinary-wolverine-60
ordinary-wolverine-60

In these examples Kubernetes didn’t react to the fact that a service was temporarily or permanently malfunctioning. How can we make Kubernetes notice?

Liveness and readiness probes

The Kubernetes official documentation explains very well what are liveness and readiness probes for:

The kubelet uses liveness probes to know when to restart a container.
The kubelet uses readiness probes to know when a container is ready to accept traffic.

Essentially, liveness and readiness probes tell Kubernetes when to restart or send traffic to a container.

The code under the configuration-2 tag sets up these probes, which consists on adding the readinessProbe and livenessProbe sections to the Kubernetes manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: readylive
  [...]
spec:
  [...]
  template:
    [...]
    spec:
      containers:
        - name: readylive
          image: readylive:v1
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /id
              port: 8080
            failureThreshold: 1
            periodSeconds: 1
          livenessProbe:
            httpGet:
              path: /id
              port: 8080
            failureThreshold: 1
            initialDelaySeconds: 8
            periodSeconds: 1

If a probe gets a single failure (failureThreshold is 1), it will trigger an action:

The readiness probe will make Kubernetes stop routing traffic to the concerned replica.
The liveness probe will make Kubernetes restart the container.

The readiness probe will query the /id endpoint each second and the container will become ready if the probe is successful and false otherwise.

The case of the liveness probe is a bit different. We need to add an initial delay of 8 seconds (initialDelaySeconds) before starting the probes, in order to let the Spring Boot application start. Otherwise the container will enter the following crash loop:

Container is created
1 second later the liveness checks the /id endpoint, which is not yet ready.
Liveness probe fails and the container is restarted.
Go to 1

If we run the tests with these two probes we will see the liveness probe working well and restarting the application when the /id endpoint fails. However, when we disable a replica no probe will fail (/id still returns 200 status codes) and Kubernetes will go on sending traffic to that replica.

We can redeploy this manifest with kubectl apply -f k8s.yaml and execute the previous tests.

$ ./test.sh disable
great-dingo-86                  <- Both services getting traffic
tricky-rattlesnake-22
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
great-dingo-86
great-dingo-86
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
great-dingo-86

request to disable              <- great-dingo gets disabled
Disabled 4s

Service temporarily disabled    <- Kubernetes still sends traffic to 
tricky-rattlesnake-22              the disabled replica because the 
tricky-rattlesnake-22              status code is 200
tricky-rattlesnake-22
Service temporarily disabled
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
Service temporarily disabled
[...]
Service temporarily disabled
tricky-rattlesnake-22
Service temporarily disabled
Service temporarily disabled
tricky-rattlesnake-22
tricky-rattlesnake-22
Service temporarily disabled
tricky-rattlesnake-22
tricky-rattlesnake-22
Service temporarily disabled
great-dingo-86                  <- great-dingo is back
great-dingo-86
great-dingo-86
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
great-dingo-86
great-dingo-86

Hitting the /break endpoint will work as expected because it will make the /id endpoint fail, which in turn causes the liveness probe to fail and triggers a reboot of the container.

$ ./test.sh break                 <- Both services getting traffic
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
great-dingo-86
great-dingo-86
great-dingo-86
tricky-rattlesnake-22
great-dingo-86
tricky-rattlesnake-22
great-dingo-86request to break                  <- great-dingo breaks.
Done. All broken now.{"error": ..., "status": 500,...}
                                  <- Kubernetes restarts great-dingo 
                                   around here  
curl: (7) Failed to connect
tricky-rattlesnake-22             <- All traffic routed to tricky-
tricky-rattlesnake-22                rattlesnake while the other 
tricky-rattlesnake-22                replica restarts
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
[...]                             <- It takes a while to restart
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tricky-rattlesnake-22
tasty-badger-96                   <- Finally the container finishes 
tasty-badger-96                      restarting and both replicas 
tricky-rattlesnake-22                get traffic
tasty-badger-96
tasty-badger-96
tricky-rattlesnake-22
tasty-badger-96
tricky-rattlesnake-22
tricky-rattlesnake-22

How could we prevent Kubernetes from routing traffic to the replica during those 4 seconds? We could try to make the endpoint fail when we disable it, so that it returns a 500 status code and makes the probe fail. However that would make the liveness probe also fail and will cause a reboot of the container.

Indeed the problem is that we have one endpoint to do many things: serve API requests, indicate readiness and indicate liveness. In real life, though, we may want to avoid doing the probes against the API endpoint because it may have undesirable side effects, it may be expensive or we don’t want to have the probes in our logs.

We need then an API endpoint and two endpoints for the probes. Spring Boot Actuator provides the last ones out of the box.

Using Actuators

Among other things, Spring Boot Actuator installs a /health endpoint, which will show if the system is working properly or not. And it is enough to add the dependency in build.gradle in order to get the /health endpoint working:

dependencies {
    [...]
    implementation 'org.springframework.boot:spring-boot-starter-actuator'
    [...]
}

Note that if you run the application in a non-Kubernetes environment you will just get a single piece of information:

$ curl http://localhost:8080/actuator/health
{"status":"UP"}

But when the same endpoint is reached in Minikube we have the some more output:

$ gradle jibDockerBuild
$ kubectl delete pod -lapp=readylive
$ curl "$(minikube service --url=true readylive)/actuator/health"
{"status":"UP","groups":["liveness","readiness"]}

And we have a couple of new endpoints:

$ curl "$(minikube service --url=true readylive)/actuator/health/readiness"
{"status":"UP"}
$ curl "$(minikube service --url=true readylive)/actuator/health/liveness"
{"status":"UP"}

We can tell Kubernetes to use these endpoints in our probes just replacing the URLs used in the manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: readylive
  [...]
spec:
  [...]
  template:
    [...]
    spec:
      containers:
        - name: readylive
          image: readylive:v1
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            failureThreshold: 1
            periodSeconds: 1
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            failureThreshold: 1
            initialDelaySeconds: 8
            periodSeconds: 1

We also have to update the status of those endpoints in response to the /disable and /break endpoints. We will use the AvailabilityChangeEvent.publish method in order to communicate to Spring the changes in readiness and liveness:

These changes are applied in the repository under the configuration-3 tag. With them, the tests will behave as we want:

$ ./test.sh disable
stupid-cat-31                   <- Both services getting traffic
stupid-cat-31
ugly-eel-70
stupid-cat-31
ugly-eel-70
stupid-cat-31
stupid-cat-31
stupid-cat-31
stupid-cat-31
stupid-cat-31
ugly-eel-70
stupid-cat-31
ugly-eel-70
ugly-eel-70
stupid-cat-31
stupid-cat-31

request to disable              <- stupid-cat gets disabled
Disabled 4s

Service temporarily disabled 
Service temporarily disabled
Service temporarily disabled
ugly-eel-70                     <- Readiness probe fails and 
ugly-eel-70                        Kubernetes stops sending traffic 
ugly-eel-70                        to stupid-cat
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
[...]
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
stupid-cat-31                   <- Readiness probe works again and 
ugly-eel-70                        traffic is sent to both replicas
ugly-eel-70
stupid-cat-31
stupid-cat-31
ugly-eel-70
ugly-eel-70
ugly-eel-70
stupid-cat3

$ ./test.sh break                 <- Both services getting traffic
stupid-cat-31
ugly-eel-70
ugly-eel-70
stupid-cat-31
stupid-cat-31
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70

request to break                  <- stupid-cat breaks
Done. All broken now.

{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
{"error": ..., "status": 500,...}
ugly-eel-70
{"error": ..., "status": 500,...}
ugly-eel-70
ugly-eel-70                       <- It seems that here Kubernetes 
                                     restarted the service but still 
curl: (7) Failed to connect          routed a request to it
ugly-eel-70                       <- All traffic routed to ugly-eel 
ugly-eel-70                          while the other replica 
ugly-eel-70                          restarts
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
[...]                             <- It takes a while
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
ugly-eel-70
cold-badger-56                    <- Finally the container finishes 
ugly-eel-70                          restarting and both replicas 
cold-badger-56                       get traffic
ugly-eel-70
cold-badger-56
cold-badger-56
ugly-eel-70
cold-badger-56
cold-badger-56
ugly-eel-70
cold-badger-56
ugly-eel-7g