K8S probes done wrong

Julio Renner
6 min readMar 25, 2024

--

Probes in K8S are health checks that let the system know if an instance of your application is working as expected. Based on the result of the check, K8S can intervene and take certain actions. However, improper use of probes can cause more harm than good.

Three probes may be configured in a K8S application:

  • Startup Probe: Indicate whether the application within the container has started.
  • Readiness Probe: Determine if a container is ready to start accepting traffic.
  • Liveness Probe: Determine if a container is running properly.

The probes are configured at the container level. This means that in a Pod with two containers, each container can have its own set of probes. By default, probes have a failureThreshold of 3, meaning there is tolerance before any action is taken.

Startup Probe

If configured, a startup probe suspends liveness and readiness checks until successful, allowing sufficient time for the application to initiate before the other probes kick off. If the startup probe fails above a defined threshold, the kubelet kills the container, and the container is subjected to its restart policy. If the startup probe is not configured, its state is set to Success by default.

When configured, the Startup Probe flow will look like the following:

Startup Probe Flow

Configuring initialDelaySeconds for the readiness and liveness probes can get a similar result, though, startup probes are more precise since you do not have to guess a delay for the application to start up. For more details on the probes configuration, including initialDelaySeconds, check the K8S documentation.

Readiness Probe

A readiness probe is used to know when a container is ready to accept traffic. When a Pod fails the readiness probe it is considered not ready and it is removed from service load balancers. Meaning that traffic is no longer forwarded to it. It is important to clarify that a container is not killed due to readiness probe failures.

The Readiness Probe Flow image describes the actions taken by K8S according to the probe result:

Readiness Probe Flow

Now, in the image below you can see a demonstration of the Readiness Probe in action, the scenario demonstrated is the following:

  1. The probe is successful and the Service forwards requests to the application.
  2. The probe fails, though, the failure is still in the accepted threshold. Requests are still forwarded to the instance and the probe is retried.
  3. The probe fails above the accepted threshold. The instance no longer receives requests.
  4. Once the probe succeeds the instance starts receiving traffic again.

Liveness Probe

Indicates whether the container is Running. If the liveness probe fails, the container is killed and is subjected to its restart policy. If the process in the container is able to crash on its own whenever it encounters an issue or becomes unhealthy, a liveness probe may not be needed. An example of usage of a liveness probe is a deadlock, where the container enters a state where it is running but it's unable to respond to requests.

Liveness Probe Flow

Let's check two different scenarios, focused on liveness and readiness probes.

Scenario 1:

  • Application A is a web server that also consumes messages from a Kafka topic.
  • The Readiness probe of Application A takes into account the state of the Kafka connection.
  • Every time that a new consumer connects to a Kafka Topic, the consumer group is rebalanced, making the connection check temporarily unhealthy. A new consumer in this scenario is a new instance of Application A.

Since Application A is a web server and a Kafka consumer, every time the Kafka connection is unhealthy the web server portion of the application becomes unavailable. In each scale activity of the application (up or down), a new Kafka rebalance is triggered and the connection becomes unhealthy, meaning that the application is unable to receive requests, causing downtime. It eventually gets back, though, by then customers were already impacted.

In the image below you can see that:

  1. There are two instances of the application Ready and Receiving requests.
  2. A new instance is added by K8S due to a scaling activity.
  3. The existing instances Readiness Probe start failing due to the Kafka consumer group rebalance, since the failures didn't go above the threshold, requests are still forwarded to them.
  4. All instances become NOT Ready and K8S stops forwarding traffic to the application.
  5. Eventually, the Kafka consumer group rebalance is finished and Readiness Probe succeeds, traffic is then forwarded to the application again.

One additional and relevant piece of information for this hypothetical scenario 😜, is that the decision to monitor the Kafka connection state was made consciously to solve frequent issues during Kafka consumer group rebalances, where deadlocks were encountered and the application had to be restarted to solve it. However, the decision was made for the liveness probe, not readiness. Another mistake made here is that the same endpoint was reused for both readiness and liveness probes. Meaning that the check was propagated to the readiness probe unintentionally.

Scenario 2:

  • Application B is a web server that communicates with an external dependency through RabbitMQ.
  • The liveness probe of Application B considers the queue size of RabbitMQ in its check. If above a threshold, it fails.

Once the queue size is above a threshold for any reason (let's say there's indeed an issue in the consumption of the messages), the application gets into a state of consecutive restarts. At some points all the instances could be unavailable, leading to sporadic downtime and performance degradation. Therefore, the liveness probe just made our existing issue worse.

In the image below you can see the instability in the application caused by improper usage of the liveness probe:

Lessons learned

  1. Understand properly how probes behave before setting them up. They can make your problems worse.
  2. Avoid checking application dependencies in Liveness/Readiness probes. Instead, have specific monitoring for them.
  3. Be careful reusing the same endpoints for probes. Liveness and Readiness probes behave differently, and developers may not realize the real impact when updating both when changing the endpoint.

To identify what to check in each endpoint, ask yourself the following questions:

  • Startup Probe: What do I need to check to confirm the application initialization is finished?
  • Readiness Probe: What do I need to check to confirm the application is ready to respond to requests?
  • Liveness Probe: What do I need to check to confirm that the application needs to be restarted?

Hope it helps, feedbacks are always welcomed :) Also, curious to hear about additional scenarios in which probes caused more harm than good.

References

  1. https://www.padok.fr/en/blog/kubernetes-probes#Liveness_probe
  2. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
  3. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
  4. https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html?source=post_page-----0f5302f2ff8b--------------------------------
  5. https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes?source=post_page-----0f5302f2ff8b--------------------------------

--

--

No responses yet