We’re all searching for something… Photo by Alex Franzelin

Docker Swarm: Flat File Engine Discovery

Jeff Nickoloff
On Docker
Published in
8 min readMar 28, 2016

--

The other day I was looking at Docker Swarm documentation and rediscovered that you can provide flat file lists instead of key-value store URLs. The research for Evaluating Container Platforms at Scale was fresh in my mind and the idea of eliminating the primary scaling bottleneck was tempting. Too tempting to let go without a test.

From my experience there are a fair number of use-cases where engines in a cluster will come up in a predetermined IP space. Expanding on that I found myself wondering how a Swarm manager would perform if I provided a full subnet of IP address/port pairs and let it inspect each for a running Docker engine. Traditional wisdom lead me to guess this would either beat up the unsuspecting network or the CPU on the machine. I wanted to see the data before making that kind of assertion and designed a performance test.

If you’re learning Docker and find this article useful you might consider picking up a copy of my new book, Docker in Action.

Flat File Discovery and Membership

Cluster membership is not static. Engines come and go. Swarm needs to be able to adjust to those changes quickly in order to maintain a healthy cluster. I tested that this is the case by creating one such cluster in an AWS VPC with randomly assigned IP addresses. Given a file containing all possible IP addresses in a /24 network, the Swarm manager was able to discover the running engines. I terminated some of those instances and let the autoscaling group redeploy new engines. Those new engines came up with new IP addresses (already included in the candidate list). Swarm was able to discover them automatically. Everything worked as you would hope.

ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
info | grep Status | sort | uniq -c
5 └ Status: Healthy
4091 └ Status: Pending
<kills four instances>ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
info | grep Status | sort | uniq -c
1 └ Status: Healthy
4095 └ Status: Pending
<waits a few minutes>ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
info | grep Status | sort | uniq -c
3 └ Status: Healthy
4093 └ Status: Pending
<waits a few more minutes>ubuntu@swarm-manager:~$ docker -H tcp://localhost:3376 \
info | grep Status | sort | uniq -c
5 └ Status: Healthy
4091 └ Status: Pending

Resource Impact as the Candidate IP Address Space Grows

This approach is only feasible and reusable in general cases where the candidate IP space is broadly defined and simple to maintain. If a user creates a topology such that all engines will run on IP address in 10.0.128.0/24 then the candidate address pool could remain the same as engines come and go, or the cluster scales up and down. I designed a simple scenario and measurement tool in order to determine how this strategy performs as the size of the subnet increases.

The test used real AWS instances and a VPC to create a small cluster in an overwhelmingly large address space. It goes on to launch Swarm managers with different candidate IP address pools with increasing magnitude. The test measures the impact to CPU and network usage at each magnitude.

The test uses eight subnets of different sizes. Each subnet was expanded to IP address and port pairs in a different file. Those files and their contents are defined as follows:

slash-17-list.txt contains the block 10.0.128.0/17 ~32767 addresses
slash-18-list.txt contains the block 10.0.128.0/18 ~16383 addresses
slash-19-list.txt contains the block 10.0.128.0/19 ~8191 addresses
slash-20-list.txt contains the block 10.0.128.0/20 ~4098 addresses
slash-21-list.txt contains the block 10.0.128.0/21 ~2047 addresses
slash-22-list.txt contains the block 10.0.128.0/22 ~1023 addresses
slash-23-list.txt contains the block 10.0.128.0/23 ~511 addresses
slash-24-list.txt contains the block 10.0.128.0/24 ~255 addresses

This test uses expanded CIDR blocks, but there is no reason the range need be so simple. You could always combine peer blocks or choose arbitrary addresses, but doing so trades ease of maintenance for flexibility.

The test was conducted in AWS using this CloudFormation template and with the following layout:

I tested the stack above using an m3.medium for the Swarm manager and t2.micros for live engines. This test does not run any containers on those live engines, but it does require some target(s) engines for discovery.

The harness framework is a set of simple shell functions with the “test” function at the core:

# $1 is the size of the CIDR bitmask
# The candidate IP list is injected into the container via volume.
# Use --net host for several reasons
# Use --restart always just in case the manager tips over
test() {
docker $OPTS run \
-d --name manager \
--restart always \
--net host \
-v "$(pwd)"/slash-$1-list.txt:/tmp/cluster.txt \
swarm:latest \
manage -H tcp://0.0.0.0:3376 \
--strategy spread \
file:///tmp/cluster.txt
}

Each phase of the test is instrumented and conducted using the following function:

# $1 is the size of the CIDR bitmask
phase() {
echo Starting phase $1
date
TS=$(date +%s)
echo $(date +%s) Starting phase $1 >> event.log
# Gather baseline numbers
nstat > $1-before-nstat-$TS.log
ss -s > $1-before-ss-s-$TS.log
netstat -i > $1-before-netstat-i-$TS.log
ip -s link > $1-before-ip-s-link-$TS.log
#Start the test
test $1
# The test is running, let it soak for a bit
echo Soaking at /$1…
echo $(date +%s) Soaking phase $1 >> event.log
# sleep for 5 minutes (data points are 5 min)
sleep 300
echo Gathering network deltas and cluster health.
date
TS=$(date +%s)
health > $1-final-health-$TS.log
nstat > $1-after-nstat-$TS.log
ss -s > $1-after-ss-s-$TS.log
netstat -i > $1-after-netstat-i-$TS.log
ip -s link > $1-after-ip-s-link-$TS.log
echo $(date +%s) Cleaning up phase $1 >> event.log
# Stop the test and clean up the container
clean
# get to the next data point at near rest
echo Calming down from /$1…
sleep 300
}

Before diving into the results from nstat, netstat, ss, and ip take a look at the CloudWatch metrics. The summary and breakouts below are as measured on the instance running the Swarm managers.

All phases of the test were performed on the same EC2 instance but with 5 minutes between each test (the low points). Tests were not timed to start on exact data points and so the data does straddle those points (meaning averages and sums will never be peak or totally at rest).

Default CloudWatch instance dashboard uses 5min data points and averages.

Looking at the summary you can quickly identify that the CPU and network out graphs contain data showing a growth over time. The status check graphs also indicate that there were some failures. These three should be examined further.

As I drilled into the outbound network traffic I examined the sum of all reported data during each 5 minute interval. The reason is that we care about the amount of data written to the network, not the average of the data points reported during the interval. The sum is a real tangible number that is a reasonable proxy for network usage.

The graph is somewhat misleading because of the non-zero valleys. Those are an artifact of the datapoint straddle. However, we can say that even if we add every two data points together we never breach 20MB of data written to the network. Modern networks have gigabit per second or 10-gigabit per second network cards. Relative to other types of network traffic 20MB/5min seems like light usage. I’ll need to dig into the network statistics to learn more.

When examining CPU utilization I thought it more important to look at maximum data points rather than averages. Doing so reveals the hardest the machine has to work at any point in the 5 minute window.

CPU utilization grew at an exponential rate in a direct correlation to the candidate IP pool’s growth in powers of two. Since the largest tests pegged the CPU at 100% it is clear that CPU is the limiting factor. But it is important to note that the service remained healthy until it evaluated candidate IP pools of with more than 16383 entries. It wasn’t until the pool reached 8191 that the CPU even hit peaks of 40%.

The last CloudWatch metric to examine is the 1 minute sum of failed status checks. EC2 uses a set of common machine and network level status checks to determine instance health. With the exception of the data point at 2pm, this graph seems to indicate that there is a strong correlation between 100% CPU utilization and failed status checks. Few should be surprised by that revelation.

Analyzing nstat, netstat, ss, and ip output

All of the logs from my test are available for inspection, but the most extreme case should reveal the network impact with obvious data. For that reason this section focuses on the test for a /17 network. It turns out that while each of these tools emit interesting data, nstat is the highest fidelity.

The output from nstat tells us what one might expect. Performing active engine searches in a /17 network requires massive numbers of IcmpOutMsgs, IpOutRequests, TcpActiveOpens, and TcpOutSegs. The result of which are proportionally huge numbers of IcmpInDestUnreachs, TcpAttemptFails, TcpRetransSegs, TcpExtTCPTimeouts, and TcpExtTCPRetransFail. The table below shows some of the before and after snapshots:

Metric                 Baseline       After
IcmpOutMsgs 1,750 132,734
IpOutRequests 10,362 2,993,778
TcpActiveOpens 1 736,933
TcpOutSegs 49 737,454
IcmpInDestUnreachs 1,750 130,555
TcpAttemptFails 510 81,418
TcpRetransSegs 730 103,477
TcpExtTCPTimeouts 1 733,583
TcpExtTCPRetransFail 7,833 2,020,104

Clearly number of bytes on the network is not an issue. But the question remains if this approach is too “chatty.” This is where I’ll ask a networking expert to weigh in, but from what I’ve seen this is viable. The benefit to choosing a reasonably sized static candidate pool is one fewer point of failure.

As usual all logs, templates, code, and input have been made available in GitHub:

TL;DR I abuse flat file Swarm engine discovery and avoid KV stores without sacrificing cluster flexibility. I tested the Swarm manager’s ability to discover up to 32,768 potential engines on a predefined network without consulting a KV store. Network impact is minimal, but the CPU gets a bit hot detecting in subnets bigger than /18. It is my opinion that inspection based discovery should be preferred when possible.

--

--

Jeff Nickoloff
On Docker

I'm a cofounder of Topple a technology consulting, training, and mentorship company. I'm also a Docker Captain, and a software engineer. https://gotopple.com