DevTestOps: CI/Testing: Application Level Readiness Detection to Remove Docker Initialization Race Conditions

Why Layer 4 Initialization Detection isn’t Sufficient

dm03514
Dm03514 Tech Blog
8 min readAug 28, 2018

--

Continuous Integration (CI) testing dependencies (specifically docker) introduces a need for inter-process synchronization in order to detect, and wait for, service readiness. Not having this synchronization can contribute to flaky and expensive test suites and builds. This article aims to cover why application level readiness detection, within the context of CI, is necessary and how to reliably accomplish it. Finally, this article introduces wait-for, a utility to reliably detect readiness at an application level for common services.

Problem

With the rise of docker-compose integration and service tests can easily codify and manage services dependencies. Starting dependent services for testing requires tests suites to be able to recognize when these dependencies are ready in order to begin test execution. Running services in separate processes is a concurrent operation and introduces a potential for race conditions between the time a service is started and the time they are ready to perform work. In order to reduce flakiness and remove this race condition reliable application level readiness checks are required.

Let’s try and illustrate with an example: suppose we have a service with integration tests. These tests exercise the system and its dependencies (in this case Redis and Postgres). When the integration tests are executed the code will make actual calls to Redis and Postgres. Because of this Docker Compose is being used in order to provide network isolation between build jobs:

# docker-compose.ymlversion: '3'
services:
redis:
image: "redis:alpine"
postgres:
image: postgres:9.6
ports:
- 5432:5432

To continue the hypothetical scenario integration tests are executed as part of the CI pipeline below:

The integration tests require that Postgres and Redis be initialized and ready to perform work.. Without synchronization this can result in errors due to tests being executed before services are fully initialized. The graph below visualizes this; showing an unsynchronized execution of the integration tests. Redis and Postgres are started and the integration tests are executed before they are fully initialized resulting in errors.

This is an example of a race condition.

A race condition or race hazard is the behavior of an electronics, software, or other system where the output is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when events do not happen in the order the programmer intended.

In this case the integration test is blindly assuming that Redis and Postgres will be ready by the time it begins execution. What’s missing is a step which detects when services are ready and it’s safe to begin executing the integration tests.

There are a couple examples of this missing step found in the wild, a few of which are covered below.

Commonly Found Solutions

Timeout (anti-pattern)

Timeouts can be effective at avoiding synchronization errors in a majority of cases but are ultimately not solutions to synchronization issues, only avoidance. They are compelling because of how trivial they are to implement.

Timeouts often materialize as sleep calls. Because of how easy it is to implement timeouts and their relative effectiveness the use of timeouts are ubiquitous in CI. Drone.io a popular open source docker based continuous delivery platform even recommends timeouts in their official documentation.

http://docs.drone.io/services/#initialization

Timeouts have many issues; the biggest is that the root race condition isn’t being addressed. Timeouts are a timing hack with no explicit synchronization. Any success with a timeout is perceived and not enforced. It’s very hard to choose an appropriate timeout that will never result in a race. Because developer machines are usually different sizes than CI servers a timeout that seems healthy locally can result in intermittent failures on a build machine. Finally, in order to have timeouts succeed in the majority of cases higher than necessary timeout values need to be chosen resulted in wasted time.

Implicit Timeouts

Another variant of timeouts are implicit timeouts. Suppose we had the build pipeline image from above implemented as build steps in a CI tool:

- ./static-analysis.sh
- ./execute-unit-tests.sh
- docker-compose up -d
- ./execute-integration-tests.sh

The implicit timeout appears to work because there is some step between starting inter-process dependencies and executing integration tests which give the integration dependencies enough time to fully initialize:

- ./static-analysis.sh
- ./execute-unit-tests.sh
- docker-compose up -d
- pip install -r integration.requirements.txt
- ./execute-integration-tests.sh

In this case we’re performing some network IO which may take long enough in majority of cases to give the perception that everything is initialized by the time the integration tests are executed. Implicit timeouts have all the downsides as explicit timeouts.

wait-for-it.sh (nc -z)

wait-for-it.sh is the official recommendation by docker. wait-for-it checks to see if a port is open or not without sending any data:

-z      Specifies that nc should just scan for listening daemons, without sending any data to them.  It is an error to use this option in conjunction with the -l option.

If the service has a socket open and bound to a port is synonymous with it being truly “ready” and able to perform work than wait-for-it would be sufficient, but I have still encountered races for a number of popular services using wait-for-it. wait-for-it polls for readiness in a sustainable way and yields or invokes a subcommand when the target is ready. It’s only issue is that it only checks the TCP level and is unable to check service/protocol specific readiness. wait-for-it is a Layer 4 (TCP/socket) way to detect initialization.

Some services require Layer 7 (Application) aware support in order to determine when they are fully initialized. I’ve recently had to test against Hashicorp’s vault. Vault would start and begin listening very quickly. In this case wait-for-it.sh would detect it as listening and continue, but vault was not fully initialized! Tests dependent on Vault would fail because the application wasn’t yet fully initialized even though it was listening successfully on a socket. To detect vault initialization would have to do something like:

EXPECTED="{\"initialized\":true,\"sealed\":false,\"standby\":false}"
VAULT_STATUS=$(curl -s http://127.0.0.1:8200/v1/sys/health | jq -r -j -S -c '{initialized, sealed, standby}')
until [ "$VAULT_STATUS" = "$EXPECTED" ]; do
VAULT_STATUS=$(curl -s http://127.0.0.1:8200/v1/sys/health | jq -r -j -S -c '{initialized, sealed, standby}')
echo "$SCRIPT polling vault status, expected: $EXPECTED received: $VAULT_STATUS"
sleep 1
done

To faithfully detect when Vault is fully initialized a Layer 7 solution is necessary.

One-off/Adhoc scripts

Once again docker documentation identifies this as a potential solution. This is the solution I find myself implementing as well. These scripts are small and simple and often end up as little glue code in repos that allow services to defensively wait until their dependencies are ready. The following snippet is from docker documentation and detects when Postgres is ready.

#!/bin/bash
# wait-for-postgres.sh

set -e

host="$1"
shift
cmd="$@"

until PGPASSWORD=$POSTGRES_PASSWORD psql -h "$host" -U "postgres" -c '\q'; do
>&2 echo "Postgres is unavailable - sleeping"
sleep 1
done

>&2 echo "Postgres is up - executing command"
exec $cmd

As you can see the check required to detect if Postgres is ready is to be deeper than just checking if Postgres is listening on port 5432.

Ideally, any solution would be correct, widely available, well tested, and be available for our target architectures. Having a shared solution would enable uniformity across projects. Unfortunately one-off or adhoc scripts don’t achieve this. While they accomplish the job they encourage duplication, diverging patterns and even varying levels of correctness. For example, one time I wrote one of these for Vertica database, while I was checking readiness based on filesystem state, it only worked on my machine and it was only succeeding because of timing. When a more experienced Vertica operator saw it they were able to modify it to reliably detect readiness. The problem was this script was shared and duplicated across many projects so other projects were not able to easily opt in to the improvement (ie through increasing a version identifier)

Not having a centralized solution limits the ability to make changes that many teams and individuals can benefit from.

A note on resiliency

Both drone and docker documentation suggest relying on the backoff/retry resiliency engineering technique. Drone documentation suggests to:

you may need to wait a few seconds or implement a backoff

While docker documentation also suggests:

The problem of waiting for a database (for example) to be ready is really just a subset of a much larger problem of distributed systems. In production, your database could become unavailable or move hosts at any time. Your application needs to be resilient to these types of failures.

To handle this, design your application to attempt to re-establish a connection to the database after a failure. If the application retries the connection, it can eventually connect to the database.

Retries enable a service to handle intermittent failures and continue to run despite a dependency not being available. I feel like this is generally a bad pattern for testing because it implicitly handles the error and violates single responsibility; using this technique the service code is also responsible for implicitly determining when dependencies are available.

I DO think these techniques are absolutely necessary and would work well for outside the context of a test suite: ie services and dependencies are brought up. The docker documentation applies to bringing up a service and its dependencies which

Solution

In order to address the issues above a Layer 7 (application) aware solution is required. I’ve created wait-for to address this. wait-for is a toolkit which detect when commonly used services are fully initialized. It adopts an application or protocol specific detection ie the way wait-for detects mysql initialization is unique to the way it detects Postgres or Cassandra or Redis (etc) readiness. It aims to be both Simple and Correct (Service/Protocol Specific)

Using it is as simple as getting the binary, and executing the application specific subcommands with the required arguments:

export WAIT_FOR_POSTGRES_CONNECTION_STRING=postgresql://root:root@localhost/test?sslmode=disable- ./static-analysis.sh
- ./execute-unit-tests.sh
- docker-compose up -d
- ./wait-for redis -h localhost:6379
- ./wait-for postgres
- ./execute-integration-tests.sh

The above will bring up the integration dependencies and then poll until Redis is ready and then will poll until Postgres is ready. After the dependencies are fully ready and initialized the integration tests will execute removing the race condition.

wait-for polls a target until the target is fully initialized or a timeout is reached (similar to wait-for-it). While wait-for is currently usable the project is not yet at a stable release. Any and all contributions would be extremely welcomed.

I hope this article illustrates why synchronization is necessary for CI tests which rely on inter-process dependencies, some current common solutions and their downsides, and how wait-for addresses the current solutions pitfalls.

As always I appreciate your time and would love to hear any feedback. Thank you.

--

--