1000 EC2 Instances in a Cluster

Evaluating Container Platforms at Scale

This article addresses three questions about scaling Docker Swarm and Kubernetes. What is their performance at scale? Can they operate at scale? What does it take to support them at scale?


  • We need use-case focused benchmarks for comparing all container platforms so adopters can make informed decisions.
  • I built a benchmark tool to test both Docker Swarm and Kubernetes from a user perspective and using common methods. I only evaluated common functionality: container startup time and container list time.
  • Swarm performed better than Kubernetes.
  • Both Kubernetes and Swarm operate with clusters of 1000 nodes with a bit of work.
  • Kubernetes is a system that is both more mature and complicated from an architecture and feature perspective. Consequently it requires more effort to adopt and support.
  • The data, analyzed and presented here can help you choose a platform based on your needs.
Visual TL;DR


I am happy to disclose that this article is the product of sponsored independent research and development efforts. Docker made this work possible by paying for both the engineering time and large AWS bill for iteration and testing. It was our mutual goal to release all relevant content for inspection, reuse, and 3rd party testing. If you or your organization are interested in funding other R&D please contact me through my website: http://allingeek.com.

I learned many things during this study that go beyond the results of the benchmarking, and cluster architecture. If you’d like to discuss them sometime follow me on Twitter or watch my blog as they leak out of my head over the next few months.

This project brought a dark reality — that the cloud is not infinite — screaming to the forefront of my cluster placement decisions when I exhausted us-west-2a of all m3.medium capacity. Oops. Rollback_in_Progress.

How do Swarm and Kubernetes perform in real large clusters?

I’ll admit when I was initially asked this question I had no idea how to respond. I had read a few articles like, Kubernetes Performance Measurements and Roadmap and Scale Testing Swarm to 30,000 Containers but I had a major problem with both articles. It was difficult to understand exactly what the author tested, and what the test environment included. It was more frustrating that neither provided clear steps or tools for test reproduction. Each presented digested numbers, but little in the way of hard data because the reader was missing context. It is my opinion that those numbers are more dangerous for the reader than unsupported claims. A reader wants to believe your numbers even when they are difficult to understand.

Another problem with existing research and documentation is that few articles measure the same thing. That makes it impossible to compare reports. What we need is a common or complete set of feature benchmarks. Making that kind of tool available for testing every platform (not just Swarm and Kubernetes) would help the DevOps community in the same way QuirksMode.org helped web developers.

The challenge I faced in this project was to build a common framework for evaluating common features in a realistic deployment context while documenting the process for the reader. After all that I’d need to be able to make the resulting information accessible for a general reader.

It is important to establish a common nomenclature before proceeding. Docker works with containers. A “pod” is the smallest unit of deployment in Kubernetes. A pod consists of at least one container. The work described here used single-container pods and so for the rest of this article I will use, “container” in place of, “pod.” Later in the article I will refer to Kubernetes jobs. A job is a unit of work managed by the Kubernetes control plane which the user expects to finish. A job is performed by a pod and in this case a single container.

Selecting Operations to Measure

These systems provide different APIs and functional abstractions. Swarm is one of the Docker projects and exposes the same container primitives as Docker at a machine cluster level. You can use it to manage individual containers and list the containers running in your cluster. Kubernetes provides parallel launch features, autoscaling, a distinction between batch and service lifecycles, load balancers, etc. After a full evaluation I determined that the two best candidate operations for measurement are container startup delay and container list time.

Minimizing container startup delay is important in latency sensitive applications. For example a just-in-time container service such as AWS Lambda will not start a container until it receives a request with a dependency on that specific function. Just-in-time containers could also have practical application in the IoT space. Job or batch-processing software is by its nature, “on-demand.” Stateless microservices and immutable deployment units are regularly advocated and adopted. I believe that as these trends continue moving toward just-in-time services and reactive deployment infrastructure is natural. Low latency is a requirement for that type of system.

There may not be the similar set of use-cases for listing containers that require low latency. But, I can attest that delay in listing the containers in a cluster is directly related to operator frustration from my now considerable experience. If an operation like this is so slow that an operator considers building a background task that updates a local cache then we can do better.

Developing the Test Harness

Both Kubernetes and Swarm offer retrieval of container lists via a synchronous API call. That means the command line client will not exit until the client receives the entire list. For that reason it is simple to use the time command to measure container list times. Swarm offers a synchronous interface for creating containers so it would be possible to use the same pattern to measure container start delay. But, Kubernetes uses an asynchronous interface for this use-case. That means that the client will exit before the new container has actually started. The test required a different approach.

It might be tempting to use the list containers interface to poll for completion, or use the “running time” to figure out how long the container took to start. That strategy would rely on being able to poll with reasonable resolution. The results will show that the list container delay is too great to use for measuring operations in tens or hundreds of milliseconds. Measuring that way would introduce too much error.

It is better to use common functionality. In this case both platforms will eventually start a container. It is possible to measure precise container start delays using time by listening on a port on the test machine using netcat (a network utility), and in parallel starting a container on the cluster that sends a single character to netcat running on the test machine. The timed process will exit when it receives the message from the new container.

Measuring startup delay using this method does introduce some overhead, but that overhead is common to both platforms and does not impact the comparative results.

The Test Context

I tested both Kubernetes and Swarm in clusters of 1000 nodes on Amazon Web Services. This was a natural choice given the popularity of the AWS platform. Nodes ran on m3.medium instances (1 vCPU, 3.75 GB). Infrastructure machines used m3.xlarge instances (4 vCPU, 15GB). Key-value databases ran on c4.xlarge instances (4 vCPU, 7.5 GB). The Kubernetes API tier ran 6 redundant copies. The tests ran on a single machine (t2.micro) inside of the same VPC.

I selected these instance types based on those chosen for other research, and observed performance on early iterations. The performance characteristics of common components: test node, worker nodes, and etcd are identical. The resources offered by each node in a fleet will not impact performance of these tests unless those resources cannot support the platform components. Make instance type and cluster size decisions for your own deployment with overall resource requirements, scaling resolution, and cost in mind. In this test the software has modest requirements.

Filling each cluster used sleeping programs that were preloaded onto the machines. Preloading images is a common strategy used by adopters to reduce variability and startup latency. Kubernetes nodes used the “gcr.io/google_containers/pause:go” image, and Swarm used the sleep command provided in the “alpine:latest” image from Docker Hub. The measured containers were created from the “alpine:latest” image and only execute nc.

Both clusters use documented scaling best practices and guidance from domain experts. These configuration differ from a production environment in the following ways:

  • Neither cluster used multi-host networking facilities.
  • Neither cluster used a high-availability key-value database configuration.
  • Neither cluster uses secure TLS communication and the network configuration is not hardened.
  • The Kubernetes authentication and authorization features were not enabled.

The software I tested was Kubernetes v1.2.0-alpha.7 and Docker Swarm v1.1.3-rc2 (as pulled from dockerswarm/swarm:master). Both clusters used etcd v2.2.1 for the key-value database.

The aim of the benchmark is to establish some statistical confidence in performance probabilities at predetermined levels. To do so it measures (or samples) the same operation many times. The performance distribution of the sample set at each target level helps us understand the performance probabilities of “the next” operation. Greater sample sizes increase precision and reduce the impact of noise or anomalous samples.

The benchmark gathers 1000* samples at target levels of usage. Those target levels are 10%, 50%, 90%, 99%, and 100% full. In this context a node is full when it is running 30 containers. A 1000 node cluster is full when it has 30,000 running containers. Measurements are collected one at a time as not to create any sort of resource contention on the test box or API servers.

*I lowered the Kubernetes list operations sample size to 100 at some levels to help tests complete. Later in the benchmark each operation was taking over 100 seconds. At 1000 samples it would have taken approximately 28 hours to complete for each target level.

Benchmark Results

Statistics are often misinterpreted. The charts below and surrounding conversation are as simple as possible as not to confuse the reader. Further, I would never ask you to just “trust me” when it comes to interpreting data. I’ve made all the results available in GitHub. You are free to perform your own analysis.

The following four charts provide data that helps answer the question, “How long will it take to start the next container when my cluster is X% full?” Each chart has a title like, “10% of Sample Containers Started in Less Than…” This means that there is a 10% chance that the next container you start at the target level will start in less time than indicated on the chart. All values are in seconds.

If you were hoping to start a container on Kubernetes in less than a second, it doesn’t seem probable if you already have some load on the system. There is only a 10% chance that the 3001st container will start in less than 1.83 seconds.

Half the time Swarm will start a container in less than .5 seconds as long as the cluster is not more than 90% full. Kubernetes will start a container in over 2 seconds half of the time if the cluster is 50% full or more.

Even the 90th percentile values have clear separation between the two platforms. Swarm is between 4and 6 times faster at starting containers.

Before commenting on the results, something went awry during the 90% and 99% full test for Kubernetes. In discussing it with colleagues we speculate that the linear nature of the test and “state buildup” caused problems for etcd. Those problems caused the system to experience an anomalous slow down and become unstable in the latter part of the 99% test. I have included the samples in the graph because it is interesting data that might show a stability problem as clusters age. I’ve heard word that there are several performance enhancements and cleanup tasks implemented in etcd v3 that might address the stability problem. I used a fresh cluster to test Kubernetes at 100% full. For these reasons I’ve discounted the 90% and 99% full tests from the following analysis.

In analyzing these charts and the complete data you can make a few important observations:

  • The 30001st container on Kubernetes will usually start in 2.52 to 3.1 seconds and .56 to .67 seconds on Swarm.
  • Kubernetes container startup delay will deteriorate between .69 and .74 seconds as the cluster fills from 3000 containers to 30000 containers but only .22 to .27 seconds on Swarm.
  • The longest observed startup delay for Swarm was 1.14 seconds (30001st container), which was shorter than the shortest observed startup delay for Kubernetes (1.69 seconds on 3001st container) by more than half a second.

Large sample sizes are important when performing this type of benchmark because they help you better understand the probability of encountering a specific result. In this case we know that Kubernetes will take a few seconds to start a single container. Swarm will always start a container in less than a second, and less than .7 seconds in all but 10% of cases. That low latency is sure to have a practical application in real infrastructure decisions.

If I were building a just-in-time container system like AWS Lambda, I would clearly base that system on Swarm.

Anecdotally, I’d like to add that the Kubernetes parallel container scheduling provided by replication controllers is remarkable. Using a Kubernetes replication controller I was able to create 3000 container replicas in under 155 seconds. Without using parallel requests it would take approximately 1100 seconds to do with Swarm and almost 6200 seconds on Kubernetes.

Its a great idea to use parallelism when possible and that is where orchestration platforms shine.

The results for list containers follow. Remember, the Kubernetes results for 99% full were anomalous and I used a healthy cluster to benchmark at 100% full.

The Kubernetes list container performance for 90% and 100% full were consistent with other test iterations.

Both Swarm and Kubernetes perform well at lower container densities. At 10% full Kubernetes will almost always list all containers in less time than it ever takes to create a container.

The data shows that somewhere between 50% and 90% full Kubernetes passes some threshold where performance degrades much quicker than it did between 10% and 50% full. Drilling into the data I found that the 50th percentile was 6.45 seconds while the 75th percentile was 28.93 seconds. I’d interpret that to mean that Kubernetes list container performance began to tip when the Kubernetes cluster was 50% full.

The Swarm performance data is less dramatic but it does present insight. Like Kubernetes, Swarm has a performance tipping point where flat growth jumps to flat growth at a much higher level. That tipping point effects fast times between 90% and 99% full, and slower times between 50% and 90% full.

The 99th percentile Swarm performance is between 4 and 6 times faster than Kubernetes 10th percentile performance at all tested levels.

After I ran the benchmark I wanted to confirm my findings against existing sources. I found a Kubernetes GitHub issue from January 2015 investigating container startup time. They were able to hit their goal, “99% of end-to-end pod startup time with prepulled images is less than 5s on 100 node, 3000 pod cluster; linear time to number of nodes and pods.” I’m not sure what environment they were using to test, but the numbers I collected have the same order of magnitude.

The comments in the GitHub issue pointed out that testing end-to-end container startup delay is different from scheduling throughput. This article released a few weeks ago by Hongchao Deng at CoreOS targets the Kubernetes scheduler and measures performance in a virtual or hollow environment. That test isolates scheduler performance from other sources of noise like the network, or resource contention. Real environments do not have that isolation and it may be unrealistic to expect that performance in the wild. The CoreOS article was particularly well written and I read every word. It is a best-in-class example of applied engineering for enhancing a single component.

In summary, these benchmarks show that the tested operations are faster on Swarm than Kubernetes. Both clusters used the same database technology, test harness, and network topology. We can infer that the reason for the difference is rooted in architecture or algorithm choice. The next section dives into the two architectures and provides some insight.

Can Kubernetes and Swarm actually operate with 1000 nodes?

I was able to build out both clusters to 1000 nodes. It took something like 90 iterations, more than 100 hours of research and trials. This section describes the architectures, and resource usage anecdotes.

I started with the default tooling for each platform. Kubernetes ships with a script called kube-up.sh which launches a single master cluster in one of many clouds. Most Swarm tutorials and guides instruct the reader to provision clusters using Docker Machine (including my own). Neither approach scales to clusters with 1000 nodes.

The single master created for Kubernetes was running the etcd database, API server, kubelet, scheduler, controller-manager, flannel, kube-proxy, and a DNS server. Kube-up.sh was able to provision the cluster but as soon as a replication controller with 300 replicas was scheduled performance on the master node dropped. Memory was never a problem but the load exhausted the CPU. Resource exhaustion creates instability and performance problems which render the cluster useless.

Provisioning machines with Docker Machine is too slow and experiences unrecoverable errors too often to be viable when creating a cluster with 1000 nodes. My initial experiment used something like 10 executions in parallel. That took approximately 4 hours to create the first 100 nodes. I increased the parallelism to 100 and immediately regretted doing so.

That resulted in my first AWS rate limiting experience which made the AWS web console unusable.

After several more hours and creating about 100 extra nodes to make up for errors (approximate 10% failure rate — machine up, not Docker) I concluded that this was a bad idea and terminated the experiment.

Custom Templates

At the time I was only targeting tests on AWS. I selected CloudFormation because it was native to that cloud. In retrospect I should have used Terraform to port tests to other clouds.

I designed each template according to published best practices and tribal expert knowledge. The first thing I noticed is that the best practice large Swarm cluster has exactly the same architecture as the demo clusters I had created a hundred times before. The Kubernetes cluster required some changes from the architecture provisioned by kube-up. A diagram of both architectures is below.

Left: Swarm nodes and manager have a direct dependency on the KV store. Clients connect directly to the Swarm manager. Right: All Kubernetes nodes, master components, and clients communicate with a horizontally scalable API through a load balancer. Only the API servers communicate with etcd.

In the above diagrams “Docker Group” and “Node Group” are the group of 1000 nodes for each cluster. Those are the machines that run the containers requested by a user. The machine labeled “test-node” ran the tests.

An interesting note is that the Swarm architecture is similar to Omega as described in the Google article, Borg, Omega, and Kubernetes. In both systems trusted control plane components have direct interaction with the database. Kubernetes centralizes all communication between components through a set of horizontally scalable API servers. I suspect that difference both allows Kubernetes to do more with a single database at scale and is a source of increased operation latency.

The Kubernetes API service allows you to federate the database by resource and group types by setting the --etcd-servers-overrides flag. I segregated hot data (such as metrics and events) from operational resources such as jobs and pods. The Kubernetes large cluster configuration guide describes federating the database. Achieving clusters of 1000 or more nodes requires doing so.

Consider the different sequences for starting a container and how that might be reflected in the performance data. Each grey column represents a different host on the network.

To start a container in Swarm (1) a client connects to the Swarm manager. The manager executes a scheduling algorithm based on resource data that it has already collected. Then (2) the manager makes a direct connection to the target node, and starts the container. When the node starts the container it responds to the manager. The manager immediately responds to the client. The client’s connection is open for the duration of the operation.

This sequence is a based on my rough understanding of the system architecture and may contain errors. Most intra-process and noncausal interactions (an API server or component did something because it is running, not because it is reacting to a request) have been ignored.

In Kubernetes (1) a client connects to an API server through a load balancer, which (2) creates a resource in etcd. Etcd executes its state machine and (3, 4) updates watchers about the new resource through the API servers. Next, (5, 6) the controller-manager will react to the new job request and create a new pod resource by calling the API servers again. Again, (7, 8) etcd updates watchers about the new resource via API servers and (9) the scheduler reacts by making a scheduling decision and (10, 11) assigning the pod to a node via API servers. Finally, (12, 13) etcd and the API server notify the target Kubernetes node about the assignment and (14) it starts the container. After the container has started (15, 16) the state change is recorded in etcd via the API servers.

The Kubernetes workflow involves several more trips over the network, through the API, and resources in the database than Swarm. At the same time the Swarm architecture lacks a central point of control and a component to relieve pressure on the database as features are added in the future.

At scale Kubernetes API machines were running between 15 and 30% CPU. The database performed similarly. The Swarm manager used about 10% CPU usage and the database somewhere near 8%.

I suspect that Swarm could scale up to another few thousand nodes before the database began to experience problems. As Kubernetes scales the architecture will need more API server hosts (you’ll know when by keeping an eye on your CPU utilization on those machines).

Can I support these systems at scale?

Everyone has bias and opinions. Marketing material, painful experiences, comment threads, and water-cooler conjecture fuel that bias. Instead of making unquantified blanket statements I propose two means of estimating adoption and support effort required to own a system. I’d love to hear from you if you’d like to propose alternative definitions or improve my estimates. The results of my analysis (for the impatient) are:

Docker Swarm is quantitatively easier to adopt and support than Kubernetes clustering components.

Adoption effort is the effort required to go from beginner to operationally proficient on a new technology. This applies when teams adopt a new technology and each time that team adds a new member. Observing that the learning curve for any given component is proportional to the number of abstractions provided by that component, I propose calculating an adoption effort index (AEI) with the following:

adoption effort = component count * abstraction count

Like the AEI, a support effort index (SEI) is an indicator for the level of effort required to support a platform on an ongoing basis. Based on my own experience I propose that this index should be proportional to the average inter-component interaction sequence length. The SEI reflects the number of opportunities for an operation to fail. I propose calculating it with the following formula:

support effort = component count * interaction sequence length 

It is important to have criteria like these so that the community can focus on specific talking points. Conversations otherwise tend to form nebulous and emotional quagmires. It is difficult to make progress under those conditions.

Kubernetes provides a superset of functionality to Swarm, but to compare apples to apples these lists of abstractions only include those required by the test. The tested Kubernetes configuration has the following components:

  • Docker Engine
  • kubelet binary
  • Pod-master
  • Controller-manager
  • Scheduler
  • API server
  • etcd

A production configuration with networking would also include:

  • Flannel
  • kube-proxy
  • kube-dns

The tested Swarm configuration has the following components:

  • Docker Engine
  • Swarm Manager
  • Swarm in Join Mode
  • etcd

A Swarm cluster with multi-host networking would only add a single component: libnetwork — which is part of the Docker daemon — or another tool like Flannel, Weave, or Project Calico.

Eliminating the common components (Docker Engine and etcd) and discounting several low impact components like the load balancer results in a component count of 5 for Kubernetes and 2 for Swarm.

Docker and Swarm provide the same functional abstractions, but Swarm additionally abstracts the cluster itself as a single machine. These common abstractions are:

  • Containers
  • Images

Kubernetes also provides the “cluster” abstraction and additionally uses the following abstractions in the test:

  • Containers
  • Images
  • Pods
  • Jobs

The result is an abstraction count of 3 for Swarm and 5 and for Kubernetes. Kubernetes is feature-rich and this number omits “replication controllers,” “services.” Adopting those abstractions would increase the AEI.

Plugging these values in to the adoption effort index equation results in an adoption effort index of 25 for Kubernetes and 6 for Swarm.

Comparing support effort indexes can be evaluated with fuzzier math. To my knowledge all Swarm operations involve a single inter-component interaction to complete. List operations might use many such interactions in parallel. The shortest operation in Kubernetes might be the list operation which requires at least two steps. It is my understanding of the job creation operation that it requires 5 or more interactions.

The support effort index for Swarm is 2. Kubernetes is at least 25.

In reflecting on these numbers I want to make sure it is clear that it is not my intent to pick on Kubernetes. These numbers reflect that Kubernetes is a larger project, with more moving parts, more facets to learn, and more opportunities for potential failure. Even though the architecture implemented by Kubernetes can prevent a few known weaknesses in the Swarm architecture it creates opportunities for more esoteric problems and nuances.

The effort investment might be one an adopter is willing to make in exchange for the expanded feature-set currently offered by Kubernetes.

Call to Action

The engineering community at large seems enamored with containers and these wonderful tools. I am delighted each time I have the opportunity to experiment with or adopt one. These force multiplying tools are helping the whole community push technological progress faster than ever before.

But there is a difference between having a tool and understanding that tool’s strengths and weaknesses relative to the greater ecosystem. It is difficult to improve any tool that you have not measured. Further research and development like this would benefit the whole community. Open and repeatable benchmarks, approachable analysis, and clear specific criteria for evaluation will help decision makers do so from an informed and data-driven context.

Please help us further the conversation. We need data on other stacks, cloud providers, and cluster configurations. We need an expanded common benchmark that covers more features. We need a more extendable test harness. We need more people interested in taking the conjecture out of decision making.

All the code and templates I used are hosted on GitHub. Please run the tests yourself and make or suggest improvements. If you’re inspired and build something on your own don’t hesitate to share it with the world. We need the data.

Cloud Container Cluster Common Benchmark: https://github.com/allingeek/c4-bench

Evaluating Container Platforms at Scale Data: https://github.com/allingeek/ecps-data