Service Probes

Published in

DraftKings Engineering

8 min readSep 16, 2024

Overview

In a distributed system, it is important to serve traffic out of an application only when it is ready to be processed. Trying to serve traffic before all of its initialization is successful will introduce errors and timeouts. Failing to detect when a service is down will also lead to degraded service offerings.

At DraftKings, the user experience is a top priority. We heavily use microservices, as their elastic and distributed nature allows us to scale and deploy services independently, leading to a more responsive, consistent, and immersive user experience. We have thousands of instances of microservices, which we host mostly on Kubernetes. To verify the state of an application, Kubernetes uses probes.

Types of probes

Kubernetes has three probes:

Liveness Probe: Determines if a container is running. If the liveness probe fails, the kubelet service kills the container, and the container is subjected to its restart policy.
Readiness Probe: Determines if a container is ready to start accepting traffic. When the readiness probe succeeds, the container will start to receive traffic.
Startup Probe: Determines whether the application within the container has started. This is useful for slower-starting containers, allowing Kubernetes not to kill them because they take a longer time to start. Startup probes are typically used in cases where the container uses a slow-starting runtime. Some machine learning libraries must read gigabytes of data in memory before initializing them. The Readiness probe, on the other hand, is used to signify that all resources owned by the services are successfully initialized and connected.

In practice, the startup one is used less frequently because the readiness probe can usually cover it, whereas the others are very important for any service and, in many cases, a must-have.

A probe may be implemented in the following ways:

HTTP GET: Performs an HTTP GET request against the container's IP address on a specified port and path. The container is considered healthy if the response has a status code between 200 and 399.
Exec: Executes a specified command inside the container. The container is considered healthy if the command exits with a status code of 0.
TCP Socket: Tries to open a TCP connection to the specified port of the container. The container is considered healthy if the port is accessible.

HTTP GET is the most commonly used health check by service discovery and inventory tools (like Consul).

Startup Probes

Startup probes are used when the service needs to do specific initialization within itself. In the world of data science for example, the machine learning model must be loaded in memory. A lot of the models are gigabytes large and will take some time to bootstrap. This is where a startup probe may come in handy. Once the startup probe succeeds, it transitions control to the liveness and readiness probes.

If a service starts up in a couple of seconds, it may not need a startup probe.

Readiness Probes

A service is ready when it can successfully process requests. This means establishing all the connections to resources that the service owns and requires for its successful operation.

There is a debate about the dependencies that aren't owned by the service, like other services, that should be working before the readiness probe is live.

Naively, it may seem that all downstream dependencies must be up, but this makes the readiness probe very fragile.

The direct services have their own dependencies, which may also be down, and suddenly, if any service goes down, it takes down a large section of the system with it.

In rare instances, a specific service might rely on a downstream dependency for its core functionality, and then it may make sense to also include this service in a probe. However, the general solution is that a connection must be established successfully to all the resources that the service owns and that the dependencies must be ignored. Alternatively, a circuit breaker may be used; the topic is debated, and there is no consensus.

If a startup probe is not used, any required data must also be preloaded before marking the readiness probe as successful. If the service needs to have less volatile request time, initializing caches and in memory state should also be done as part of the readiness probe.

Liveness Probes

Liveness probes should be down if the only possible resolution is to restart the service. The service must be in a non-recoverable state.

Liveness probes are more straightforward to implement, as non-recoverable problems can be expressed as simple as "Can the service return HTTP 200 and successfully answer a request?" For non-web interfaces, the service often proactively updates a file, and the probe checks its last "updated at" timestamp.

If a dependency is down, the service should remain up as long as it can still serve some of its requests.

If the service's database goes down, the better customer experience is most often to precisely communicate to the client what the issue is. In some use cases, clients can compensate for dependencies going down temporarily.

Example for all 3 probes

Let us suppose there is a service that performs complicated computations. Some of the computations are stored in a database and the service's interface to place a computation is via HTTP.

Such service should have the following startup sequence:

The process is created
A very large file must be loaded in memory as it holds weights that are used for computational purposes
The interface of the service is HTTP, so it must spin up an HTTP daemon that will accept requests for calculations
The database connection is opened, which finalizes the initialization

Startup Probe will be successful after the file is loaded in memory.

The readiness probe will be successful after the database connection is established. Until then, any requests to the service that use the database will fail or be slowed down. This behavior is undesired, and the solution is to use a readiness probe.

The Liveness probe will test whether the HTTP Daemon is alive. Even if the database connection is temporarily lost, some requests will still be successful.

Let us suppose the following Kubernetes configuration:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  containers:
  - name: example-container
    image: nginx
    ports:
    - containerPort: 80
    livenessProbe:
      httpGet:
        path: /healthz
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 3
      timeoutSeconds: 1
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 1
      successThreshold: 1
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /startup
        port: 80
      initialDelaySeconds: 0
      periodSeconds: 15
      timeoutSeconds: 5
      failureThreshold: 10

There are a few things that can be noticed:

The startup probe is the most relaxed one, as it is tested only until registered as a success. As mentioned previously, it is used for slow-starting processes. Once the test has passed, the startup probe won't be tested again.
The readiness probe still offers a fairly high amount of retries, but the expectation is that it will be answered as success or failure quickly as the service is already running and it is waiting for connections to its resources to be established
The most strict probe is the liveness probe. When the liveness probe is down, the service most likely won't recover, and the only way to restore the service is to recycle the pod. Since networks can be volatile, some retries are still allowed, but its checks are the strictest out of the 3 Kubernetes probes.

Probes are not a replacement for monitoring!

Although probes can be used in more advanced ways—for example, returning objects that have diagnostics or chaining several checks into a single probe—they aren't a replacement for monitoring.

Probes exist with a specific intent and scope. Overcomplicating probes or adding more responsibilities to what they do brings the danger of the probe not working correctly in some cases and returning false positives or false negatives. Both of these scenarios are not only undesirable but are also what probes are supposed to fix.

Keep probes fast and simple, the elaborate metrics and views belong inside the monitoring.

Pitfalls

Every probe serves a different purpose, so duplicating the mechanism for any of the probes is a red flag and may be cause for concern. Is it really the correct thing verified by the probe or are there any cases in which the probe may return a false positive?

Make sure the different probes are maintained as the service or its dependencies change. When there is time pressure it may be tempting to overlook probes as they are not directly user-facing functionality. Keeping probes simple reduces the cost of building and maintaining them.

Be aware of the resources the probes use. Probes should be lightweight and not eat precious resources.

Liveness probes mean that the application is alive, but they do not guarantee that it will work as expected. They are not a replacement for monitoring and alerting.

Combining a few probes could check different aspects of a service and provide more accurate information about its current state. A service may be alive, but it may not be ready to process requests yet.

Probe timings can be tricky because competing interests must be balanced. Setting them too short risks falsely marking services as unhealthy due to brief slowdowns. On the other hand, long intervals might leave users waiting for a service that's actually unable to handle requests. Determining the correct timeouts is a per-service decision that requires testing and considering the business use case.

Restart loops can happen when liveness and readiness probes are misconfigured. If a liveness probe fails repeatedly, the pod restarts. But if the readiness probe also fails because the service isn't quite set up yet, the restarted pod will again fail the readiness probe, triggering another restart. This cycle can keep repeating. This is another reason why the correct timeout before probes begin being tested and intervals of testing are important.

The probes should not flood the application. If the probe takes more time to execute than it is tested, more server resources will be allocated to the probes instead of the business flows.

Once implemented, liveness and readiness probes do not change often. The first time they cause a bad customer experience, they usually become a hot topic. Taking a proactive approach and having them as part of a service's earliest version helps improve the customer experience.

When testing the connection to one of the probes it is important to use the existing connection pool and not open a new connection. Opening a new connection may lead to false negatives — for example, there may be no available free connection slots, but all already established connections may function as expected. The connections which the probes use should be the same ones used in the business logic.

Correctly setting up Liveness, readiness, and startup probes can make services more robust, leading to a better customer experience and, likely, more business.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!