Ensuring System Integrity: The Right Path to Health Checks in Microservice Architecture

Have you ever wondered about the right approach for health checks in your application? Let’s explore an interesting perspective!

Akash Jaiswal
Cloud Native Daily
4 min readJun 21, 2023

--

Imagine a microservice architecture with multiple services running in Kubernetes and having dependencies on each other.

🏢 In this oversimplified example, the services are directly dependent on one another, rather than being indirectly connected via a message bus or broker. While it may not be ideal, it’s not uncommon.

Let’s dive into the scenario:

📡 Each service has a liveness probe that verifies its connection to other dependent services. 💥 Now, picture a temporary blip in the network connection between “Service X” and the database, causing a 30-second connectivity loss before it’s restored.

Here’s the twist:

⚠️ Within those 30 seconds, the liveness probe fails, triggering Kubernetes to restart the application. 🔄 Consequently, Service Y fails to connect to Service X, leading to its liveness probe failure and subsequent restart. The pattern continues, causing cascading failures across services, even though most of them don’t depend on the failed service (the database).

🎯 There are two different approaches to designing health probes:

1️⃣ Smart probes aim to verify the application’s correct functionality, its ability to handle requests, and connect to dependencies like databases or message queues.

2️⃣ Dumb health checks indicate only whether the application has crashed. They focus on basic requirements, such as responding to an HTTP request, without checking dependency connections.

💡 Striking the right balance:

In my opinion, here’s the approach I prefer:

Dumb liveness checks: Focus on determining whether the application is alive. Think of it as a “restart me now” flag. If restarting the app can fix the health check, it should be part of the liveness probe. For example, if Kestrel can handle requests, the health check should pass.

Smart startup checks: During startup, perform due diligence for the application. Validate database or message bus connections and ensure the app’s configuration is valid. Startup is the best time for these checks, as configuration errors are common during deployment in Kubernetes.

🚦 Regarding readiness checks, it’s a bit more complex. In most cases, I struggle to find scenarios where the application is alive and handling requests (as checked by liveness probes), has completed startup checks (as verified by startup probes), but shouldn’t receive traffic (as indicated by readiness probes). One possible situation could be an overloaded app that needs time to process requests, but it’s not something I’ve encountered often.

⚠️ Checking dependencies in readiness probes can lead to cascading failures. Instead, it might be fragile to take apps out of circulation based on CPU utilization or RPS. Moreover, readiness probes run throughout the application’s lifetime, so they shouldn’t add unnecessary load to the app itself.

              +-------------------+
| Kubernetes Pod |
+-------------------+
| |
| Containers |
| +---------+ |
| | | |
| | App | |
| | | |
| +---------+ |
| |
+-------------------+
|
|
v
+-------------------+
| Health Checks |
+-------------------+
| |
| Liveness Probe |
| +-------------+ |
| | | |
| | App | |
| | | |
| +-------------+ |
| |
| Readiness Probe |
| +-------------+ |
| | | |
| | App | |
| | | |
| +-------------+ |
| Start Probe |
| +-------------+ |
| | | |
| | App | |
| | | |
| +-------------+ |
| |
+-------------------+

In a nutshell:

  1. Liveness Probes: Keep it simple and focus on determining if your application is alive and can handle requests. This helps detect crashes and trigger necessary restarts.
  2. Startup Probes: Take a smarter approach during initialization. Verify essential dependencies like database connections and configuration validity. Thorough checks at startup prevent configuration errors from causing issues later on.
  3. Readiness Probes: Here’s the tricky part. Continuous checks throughout the application’s lifespan ensure readiness to receive traffic. Avoid dependency checks to prevent problems from spreading. The need for readiness checks varies, so I’d love to hear your experiences!

By using simple liveness checks and comprehensive startup probes, we strike a balance between failure detection and preventing cascading issues in microservice architectures.

🤝 I’m curious to hear your thoughts on readiness checks other than, what we have discussed above!

Reference :

Further Reading:

--

--