We’re all searching for something… Photo by Alex Franzelin

Docker Swarm: Flat File Engine Discovery

Published in

On Docker

8 min readMar 28, 2016

The other day I was looking at Docker Swarm documentation and rediscovered that you can provide flat file lists instead of key-value store URLs. The research for Evaluating Container Platforms at Scale was fresh in my mind and the idea of eliminating the primary scaling bottleneck was tempting. Too tempting to let go without a test.

From my experience there are a fair number of use-cases where engines in a cluster will come up in a predetermined IP space. Expanding on that I found myself wondering how a Swarm manager would perform if I provided a full subnet of IP address/port pairs and let it inspect each for a running Docker engine. Traditional wisdom lead me to guess this would either beat up the unsuspecting network or the CPU on the machine. I wanted to see the data before making that kind of assertion and designed a performance test.

If you’re learning Docker and find this article useful you might consider picking up a copy of my new book, Docker in Action.

Manning | Docker in Action

Docker in Action teaches readers how to create, deploy, and manage applications hosted in Docker containers. After…

www.manning.com

Flat File Discovery and Membership

Cluster membership is not static. Engines come and go. Swarm needs to be able to adjust to those changes quickly in order to maintain a healthy cluster. I tested that this is the case by creating one such cluster in an AWS VPC with randomly assigned IP addresses. Given a file containing all possible IP addresses in a /24 network, the Swarm manager was able to discover the running engines. I terminated some of those instances and let the autoscaling group redeploy new engines. Those new engines came up with new IP addresses (already included in the candidate list). Swarm was able to discover them automatically. Everything worked as you would hope.

ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
  info | grep Status | sort | uniq -c 5    └ Status: Healthy
 4091 └ Status: Pending<kills four instances>ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
  info | grep Status | sort | uniq -c 1    └ Status: Healthy
 4095 └ Status: Pending<waits a few minutes>ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
  info | grep Status | sort | uniq -c 3    └ Status: Healthy
 4093 └ Status: Pending<waits a few more minutes>ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
  info | grep Status | sort | uniq -c 5   └ Status: Healthy
 4091 └ Status: Pending

Resource Impact as the Candidate IP Address Space Grows

This approach is only feasible and reusable in general cases where the candidate IP space is broadly defined and simple to maintain. If a user creates a topology such that all engines will run on IP address in 10.0.128.0/24 then the candidate address pool could remain the same as engines come and go, or the cluster scales up and down. I designed a simple scenario and measurement tool in order to determine how this strategy performs as the size of the subnet increases.

The test used real AWS instances and a VPC to create a small cluster in an overwhelmingly large address space. It goes on to launch Swarm managers with different candidate IP address pools with increasing magnitude. The test measures the impact to CPU and network usage at each magnitude.

The test uses eight subnets of different sizes. Each subnet was expanded to IP address and port pairs in a different file. Those files and their contents are defined as follows:

slash-17-list.txt contains the block 10.0.128.0/17 ~32767 addresses
slash-18-list.txt contains the block 10.0.128.0/18 ~16383 addresses
slash-19-list.txt contains the block 10.0.128.0/19 ~8191 addresses
slash-20-list.txt contains the block 10.0.128.0/20 ~4098 addresses
slash-21-list.txt contains the block 10.0.128.0/21 ~2047 addresses
slash-22-list.txt contains the block 10.0.128.0/22 ~1023 addresses
slash-23-list.txt contains the block 10.0.128.0/23 ~511 addresses
slash-24-list.txt contains the block 10.0.128.0/24 ~255 addresses

This test uses expanded CIDR blocks, but there is no reason the range need be so simple. You could always combine peer blocks or choose arbitrary addresses, but doing so trades ease of maintenance for flexibility.

The test was conducted in AWS using this CloudFormation template and with the following layout:

I tested the stack above using an m3.medium for the Swarm manager and t2.micros for live engines. This test does not run any containers on those live engines, but it does require some target(s) engines for discovery.

The harness framework is a set of simple shell functions with the “test” function at the core:

# $1 is the size of the CIDR bitmask
# The candidate IP list is injected into the container via volume.
# Use --net host for several reasons
# Use --restart always just in case the manager tips over
test() {
  docker $OPTS run \
  -d --name manager \
  --restart always \
  --net host \
  -v "$(pwd)"/slash-$1-list.txt:/tmp/cluster.txt \
  swarm:latest \
    manage -H tcp://0.0.0.0:3376 \
    --strategy spread \
    file:///tmp/cluster.txt
}

Each phase of the test is instrumented and conducted using the following function:

# $1 is the size of the CIDR bitmask
phase() {
  echo Starting phase $1
  date
  TS=$(date +%s)
  echo $(date +%s) Starting phase $1 >> event.log  # Gather baseline numbers
  nstat > $1-before-nstat-$TS.log
  ss -s > $1-before-ss-s-$TS.log
  netstat -i > $1-before-netstat-i-$TS.log
  ip -s link > $1-before-ip-s-link-$TS.log  #Start the test
  test $1  # The test is running, let it soak for a bit
  echo Soaking at /$1…
  echo $(date +%s) Soaking phase $1 >> event.log
  # sleep for 5 minutes (data points are 5 min)
  sleep 300  echo Gathering network deltas and cluster health.
  date
  TS=$(date +%s)
  health > $1-final-health-$TS.log
  nstat > $1-after-nstat-$TS.log
  ss -s > $1-after-ss-s-$TS.log
  netstat -i > $1-after-netstat-i-$TS.log
  ip -s link > $1-after-ip-s-link-$TS.log
  echo $(date +%s) Cleaning up phase $1 >> event.log  # Stop the test and clean up the container
  clean  # get to the next data point at near rest
  echo Calming down from /$1…
  sleep 300
}

Before diving into the results from nstat, netstat, ss, and ip take a look at the CloudWatch metrics. The summary and breakouts below are as measured on the instance running the Swarm managers.

All phases of the test were performed on the same EC2 instance but with 5 minutes between each test (the low points). Tests were not timed to start on exact data points and so the data does straddle those points (meaning averages and sums will never be peak or totally at rest).

Default CloudWatch instance dashboard uses 5min data points and averages.

Looking at the summary you can quickly identify that the CPU and network out graphs contain data showing a growth over time. The status check graphs also indicate that there were some failures. These three should be examined further.

As I drilled into the outbound network traffic I examined the sum of all reported data during each 5 minute interval. The reason is that we care about the amount of data written to the network, not the average of the data points reported during the interval. The sum is a real tangible number that is a reasonable proxy for network usage.

The graph is somewhat misleading because of the non-zero valleys. Those are an artifact of the datapoint straddle. However, we can say that even if we add every two data points together we never breach 20MB of data written to the network. Modern networks have gigabit per second or 10-gigabit per second network cards. Relative to other types of network traffic 20MB/5min seems like light usage. I’ll need to dig into the network statistics to learn more.

When examining CPU utilization I thought it more important to look at maximum data points rather than averages. Doing so reveals the hardest the machine has to work at any point in the 5 minute window.

CPU utilization grew at an exponential rate in a direct correlation to the candidate IP pool’s growth in powers of two. Since the largest tests pegged the CPU at 100% it is clear that CPU is the limiting factor. But it is important to note that the service remained healthy until it evaluated candidate IP pools of with more than 16383 entries. It wasn’t until the pool reached 8191 that the CPU even hit peaks of 40%.

The last CloudWatch metric to examine is the 1 minute sum of failed status checks. EC2 uses a set of common machine and network level status checks to determine instance health. With the exception of the data point at 2pm, this graph seems to indicate that there is a strong correlation between 100% CPU utilization and failed status checks. Few should be surprised by that revelation.

Analyzing nstat, netstat, ss, and ip output

All of the logs from my test are available for inspection, but the most extreme case should reveal the network impact with obvious data. For that reason this section focuses on the test for a /17 network. It turns out that while each of these tools emit interesting data, nstat is the highest fidelity.

The output from nstat tells us what one might expect. Performing active engine searches in a /17 network requires massive numbers of IcmpOutMsgs, IpOutRequests, TcpActiveOpens, and TcpOutSegs. The result of which are proportionally huge numbers of IcmpInDestUnreachs, TcpAttemptFails, TcpRetransSegs, TcpExtTCPTimeouts, and TcpExtTCPRetransFail. The table below shows some of the before and after snapshots:

Metric                 Baseline       After
IcmpOutMsgs               1,750     132,734
IpOutRequests            10,362   2,993,778
TcpActiveOpens                1     736,933
TcpOutSegs                   49     737,454
IcmpInDestUnreachs        1,750     130,555
TcpAttemptFails             510      81,418
TcpRetransSegs              730     103,477
TcpExtTCPTimeouts             1     733,583
TcpExtTCPRetransFail      7,833   2,020,104

Clearly number of bytes on the network is not an issue. But the question remains if this approach is too “chatty.” This is where I’ll ask a networking expert to weigh in, but from what I’ve seen this is viable. The benefit to choosing a reasonably sized static candidate pool is one fewer point of failure.

As usual all logs, templates, code, and input have been made available in GitHub:

allingeek/swarm-ffed-test

swarm-ffed-test - Test the limits of Swarm's flat-file cluster definitions and inspection based engine discovery.

github.com

TL;DR I abuse flat file Swarm engine discovery and avoid KV stores without sacrificing cluster flexibility. I tested the Swarm manager’s ability to discover up to 32,768 potential engines on a predefined network without consulting a KV store. Network impact is minimal, but the CPU gets a bit hot detecting in subnets bigger than /18. It is my opinion that inspection based discovery should be preferred when possible.