Comparing Amazon Elastic Container Service and Google Kubernetes

If you work in software you’ve heard the words “containers” and “docker” often enough over the last couple of years. As with any real, substantive change there is good reason for the swell in attention. Containers are a huge leap forward in the deployment of software, and enough has been said about them that I won’t add to it here. Instead I’m going to talk about why containers alone are not enough, and then take a look at Amazon’s Elastic Container Service and Google’s Kubernetes, two platforms that aim to close the gap between “docker build” and “Yay it works!”

This piece doesn’t attempt to be a detailed hands-on review of either platform. While I have worked with Kubernetes daily for almost six months I have only been looking at Amazon for a week. That’s enough time to get the lay of the land, but not enough to develop the expertise for a deep dive. Instead I would like to give a high level comparison of the features and design trade-offs as I understand them, and perhaps it’s best to begin with what these platforms are, and why they exist at all.

Image for post
Image for post

“Orchestrating” Containers

A docker container is a beautiful thing, but once you’ve built your image what do you have? Like a desalination plant sitting in the customs yard of an arid nation your container is an engineering marvel of unrealized potential. It’s not plugged into anything; it’s not doing anything valuable. What is needed is the site, wiring and piping to power the plant up and get its water to thirsty people. Likewise a container by itself is not a working system. A system is a collection of containers running on hosts in some environment, connected to clients and each other. This is even more clearly true if you pursue a “microservices” architecture.

In order to get a system you have to deploy several, perhaps even many containers, and there’s the rub. Deploying a single container to a host is an operation you can get wrong. You can deploy the wrong image, set the wrong environment variables, open the wrong ports, run under the wrong user. The list of things you can screw up is still unacceptably long, and when you multiply that risk across many individual container deployments, well guess what? Your life still sucks. What we want is to make building systems as reliable and repeatable as building containers.

Cluster management software gets us a good part of the way there. A “cluster” is a group of hosts optimized for running containers, and cluster management software aims to make defining and provisioning clusters of container hosts as easy as possible. The requirement is to provide a means to define the physical makeup of a cluster — how many machines, what base os, what networks — and of repeatably recreating the cluster from that definition. That’s a big win from an infrastructure standpoint but it gets us only part of the way to where we’re going. The abstraction we want to focus on is our applications, not clusters of hosts.

What we want is to make building systems as reliable and repeatable as building containers.

Image for post
Image for post

Cloudy, with a chance of docker

Not shockingly much of the evolution of software deployment toward containers and orchestration platforms is playing out in the clouds. Cloud provisioning of compute resources became a thing because it provided an easy “API” to all the expensive drudgery of managing hardware and telecomm assets. In the same way containers provide an interface that hides away much of the complexity of configuring a specific host to run a specific application. The two ideas are a natural fit, and so cloud providers have been among the most aggressive adopters of these technologies, and have rolled out management platforms to take advantage of them.

Amazon is the nimble giant of cloud providers, doing almost $8 billion in business in 2015. In April of that year they announced general availability of Elastic Container Service, a container orchestration offering built on their “we actually do run the Internet” platform: EC2. Google is a comparative baby in the cloud business, but a major player in many other things, and a major adopter and driver of container technology. Their internal orchestration platform, known as “Borg,” was the inspiration for their open source Kubernetes project, which is in turn the foundation of Google Container Engine. From here on I will refer to these two platforms as ECS and GKE (Google already has Compute Engine, so GCE was taken).

Comparing these two platforms is interesting because it shows the differences which arise from varying constraints. One the one hand you have one of the industry’s most committed advocates of containers using their own open source technology to build out a cloud business, while on the other you have the biggest cloud provider in the world assembling a solution from the mostly-proprietary pieces of its already wildly successful platform. They’re both offering solid products that try to take you the same place, but as I went through Amazon’s ECS quickstart and compared it with my experiences on GKE one impression began to solidify: perhaps the most important difference between the two platforms is their origin.

Google’s GKE is built on Kubernetes, and Kubernetes was designed from the ground up to be a provider-independent, functionally complete container orchestration platform. As such it can be run just about anywhere. Google runs it on GCP. Many people run it on AWS. Some people I know have installed and run the whole thing at home. In order to succeed at this game the platform has to be complete, providing all or most of the services needed to achieve the goals described above. For that reason Kubernetes feels more like a cohesive “thing” than ECS. It offers a consistent set of abstractions, and under the hood they are mapped into the provider-specific world.

… one impression began to solidify: perhaps the most important difference between the two platforms is their origin.

Image for post
Image for post

Getting Started

These differences begin to become apparent right from the beginning. The first couple of steps are nearly identical on both platforms. You create an account, you download an SDK. Both providers have functional command line clients, both have one client for the main cloud platform (aws and gcloud) and another for specific container service commands (ecs-cli and kubectl). Both have web consoles that are pretty highly developed, although we won’t dwell on them any more than we have to because console actions don’t capture intent, can’t be version controlled and are in general evil.

If you are going to use the command line tools (and even if you don’t your build automation eventually will) one of the first things you’re going to run into is how the two providers handle authentication. I’ll focus on the client side experience for the developer, and leave server-side permissions and accounts for another time. Google uses oauth to authenticate your client, and then manages everything for you, including authenticating the docker client to the registry. They do this by wrapping the docker client commands and handling the auth setup and teardown. So for example to push an image to the registry you do ‘gcloud docker push’ and it just works. If you do ‘gcloud compute config-ssh’ it will set up keys and host file entries so you can ssh into your cluster nodes very easily.

Amazon does not use oauth, and instead provides an ssh key pair for use when accessing your nodes with ssh, and separate credentials for use in making API calls. There are at least two different authentication realms. The first is access to the image repositories in your account. Instead of wrapping the docker client in something like ‘aws docker push’ there is a ‘aws ecr get-login’ command that returns a pre-built ‘docker login’ command to be run with your local docker client. Once done the client will be able to push and pull from your registry for twelve hours. If you want to use the docker client with a different registry in the meantime you’ll have to ‘docker login’ to that one first.

The second authentication realm is that of the ECS command line tool ecs-cli, and is paralleled by the kubectl tool that Google uses. These utilities provide interfaces to the cluster management software itself, and need to be authenticated to be able to execute API calls against it. In both cases you first authenticate the general-purpose cloud tool, and then call a command to authenticate the cluster management tool. The main difference here is that again Google wraps this all up pretty seamlessly. You just oauth your main client and then execute one command to config the cluster management client for a specific cluster. In the Amazon case you have to run a command and paste in your access and secret keys.

Those are minor differences, and in either case you’ll soon be past them and ready to get some containers running. With the basic stuff out of the way what do you have to do to launch your first cluster? On GKE there are two steps: 1) enable the GKE-related APIs for the project you’re working in; 2) execute the ‘gcloud container clusters create’ command with the appropriate arguments. When you’re done you have a cluster with running hosts and all the necessary software, networks, and routing ready to go. The Amazon process is a bit more complicated, and some of this stems from the fact that they have IAM roles and policies and Google does not. On Google’s cloud you have projects, and you can assign accounts to them as owners, editors, or viewers. It’s granular enough for many things, but nowhere near as configurable and powerful (or some might say, incomprehensible) as Amazon IAM.

So step 1 on the Amazon platform (even before configuring the command line tools discussed above) is to create an IAM user and log into it. You then create a key pair and save it so that you can ssh into your container instances later. With the key pair in hand you have two paths available: go through the ECS console first-run experience, or use the cli to set things up yourself. If you choose the former (and relatively newly available) option then the console’s CloudFormation script will take care of some things for you. Otherwise you will have to create two IAM roles and associate specific policies with them. One is to allow ECS to make calls into EC2 to get stuff done, and the other is to allow ECS to make calls into ELB (Elastic Load Balancer) to get stuff done.

Having done that you are now ready to run one command to create your cluster… er, no you’re not. If you continue with the console wizard then you will eventually get a cluster. Amazon calls this the “default” cluster and it is special in that it is the cluster the system will deploy stuff to if you don’t tell it differently. Maybe the single default cluster is all you need, and that will be a topic in the next section, but either way if you create it in the console CloudFormation is again going to do a boatload of stuff for you. This includes creating an auto-scaling group, a vpc, a subnet, instances in the scaling group, etc. If instead of using CloudFormation you just go to the ECS console and click “Create cluster” and then provide a cluster name … you get none of this.

Having done that you are now ready to run one command to create your cluster… er, no you’re not.

Image for post
Image for post

Clusters and Nodes and Container Instances

A cluster, as defined above, is a group of hosts capable of running docker containers. More than that, to actually be a cluster the group must have some shared state that is persistent, and that describes the current condition of the hosts, what is running on them, and so on. It is this control plane that actually grants them the property of membership in the cluster. Because the state is shared it must be accessible to all the hosts. Because it is persistent it must not reside exclusively on any of them. The solving of this problem is a big part of what a cluster management platform does.

Kubernetes solves it with a replicated key-value store based on etcd from CoreOS. Again because Kubernetes is a thing in its own right, this database is implemented in containers that run on the cluster. Amazon used one of its own core systems, a paxos-based transactional database that I believe is implemented as a service off the cluster, according to the architectural drawings I’ve seen. They both fill the same role as a central source of truth for the code running inside the cluster.

In addition to this central store each host needs some code running on it to interact with the cluster control plane, manage the containers running locally, etc. On Amazon this is the ECS Agent, and on GKE it is the several components including the kubelet and kube-scheduler. Again they do, very generally speaking, the same job. With this software installed and configured the hosts cease to be mere instances and become members of the cluster. In Kubernetes the cluster members are referred to as “nodes”, which replaced the earlier and more colorful “minions.” ECS refers to cluster members as “container instances” which is as colorful as cottage cheese. For the rest of this post I will use the term “nodes” because it’s only four letters.

So how do we create nodes and get this stuff running on them? As described in the last section GKE has a single command that creates a cluster and all its resources. Actually the command is implemented by the ‘gcloud’ tool, which uses the Google Cloud Platform management APIs to do its work. Cluster creation and functions directly related to the health of nodes— scaling and failover — are a function of the cloud platform, management of the stuff running on the cluster is a function of the container orchestration system. On GKE this distinction is clearly perceptible in the API and tools. In ECS these tasks span multiple services and their interfaces unavoidably come bubbling up to the top.

So in the Amazon world if you want to create a cluster you first use whatever interface to EC2 you wish, and either create individual instances or an auto-scaling group. But this is not enough by itself. If you spin up instances and then go create a cluster you won’t see them listed when it comes time to populate it. That is because the ECS agent software is not on them, so it has not run when they started, and has not announced itself as belonging to your cluster. To launch “container instances” rather than regular instances you must either select the Amazon standard ECS optimized AMI as the image for the instances, or you must select another AMI and run the install scripts for ECS yourself. You do this either by launching the instances manually or creating an auto-scaling group with a CloudFormation script to configure them.

Once you’ve done this and launched the instances they will be available to your cluster… if your cluster is named “default.” If your cluster is not named “default” then you will need to paste a small bit of script into the “user data” field of the advanced settings for each instance to set an environment variable with the cluster name. Now you will see the instances available to be added to your cluster in the ECS console. If that seems like a lot of complexity to you, I would agree. On the other hand most or all of this will eventually be scripted, and in at least one way Amazon does give you a bit more flexibility than Google. On GKE you must select from a list of machine types, and the machine type must be the same across the cluster. On Amazon you can use any kernel-compatible AMI as long as you’re willing to install the ECS Agent.

That last bit is important, and takes us back to the idea of the Amazon console creating a default cluster for you. One of the key questions that fascinates me as a developer working with container platforms on a daily basis is this: what is the ideal make-up of a cluster? Should clusters be tailored to workloads? Some services require a lot of memory. Some need more CPU. For others network I/O may be the only bottleneck. Theoretically, once this evolution has reached its logical conclusion I should simply be able to deploy a single pool of compute resources sufficient to meet my needs, and use control groups and other mechanisms to partition those resources between applications.

Once you’ve done this and launched the instances they will be available to your cluster… if your cluster is named “default.”

Image for post
Image for post

Running Containers

Now we can get down to the meat of it. Running containers with our software in them is what our cluster is supposed to do, and while running

docker run -d -p 80:8000 -e “VAR=val” my-web-app 

is sufficient for local testing, recall that one of the things we desire from container orchestration is to take the risk of human error out of the equation: we would like everything about the way our containers are deployed to the cluster to be declared in source controlled scripts that make it possible to recreate them accurately whenever we wish.

In that context the name of an image gives you only a very small piece of the puzzle. A way of configuring all the other stuff is needed, and both Kubernetes and ECS do this with similar abstractions. Indeed the differences here are primarily a matter of who went first and who is playing catch-up, along with the subtle shifts in perspective you get whenever two different brains look at a problem. Although I am going to discuss these things as abstractions from a certain distance, the best way to think about them is as objects which are created and saved in the cluster state, are accessible to a CRUD API, and drive state changes in the software that runs the cluster and nodes.

At the lowest level both platforms define a minimum building block that consists of 1..n containers running as a composite unit on the same host. In Kubernetes this building block is known as a pod, and on ECS it is called a task. Both specify which containers make up the group, and for each container what image it is based on. Beyond that there are additional properties to express the port mappings, volume mounts, environment variables, resource limits, and other parameters. The differences here are pretty slight and may relate mostly to maturity. Google offers more options for different kinds of volume mounts, for example. You can create these objects directly in both platforms — to run a one-time task for example. Typically, though, they will be created from higher level resources.

And that is where things start to head in interestingly different directions. A single pod or task can be used to represent a single functional aspect of a complex system. A good example might be a logging pod that runs redis and logstash containers, and exposes the redis port 6379 on the host. To install this service you could just create a pod or task as mentioned above. The result would be more durable than the same software installed onto “bare metal” because the pod or task that owns the containers is at least responsible for keeping them running. Failed containers will be restarted, but if some external condition is causing them to fail then they might just keep crashing too. It isn’t hard to imagine situations in which our logging service becomes unavailable for a period of time.

The standard solution to this problem is to have more than one instance of each service with a reverse proxy directing requests to them. If one goes down the proxy sends the requests to the other until it is back up. To implement this pattern on a clustered architecture we need some concepts that can group like pods and tasks together as a functional unit. At the very least we need some structure in which we can say how many we want, preferably giving minimums and maximums along with some scaling criteria. At best we would also get service discovery, naming, and request routing. In this area especially there is a pretty wide gap between the two services, as we’ll see.

Kubernetes defines two independent abstractions that govern the behavior of pods, and in character and function they are quite different from the approach ECS has taken. The first of these is called a replication controller. A replication controller is a piece of software that creates and controls pods of a specific kind. Typically the kind of pod is defined by a section of the replication controller object called a template, in which all the properties of a pod as described above can be set. Replication controllers are responsible for creating a certain number of pods and for keeping them running. In order to respond to variations in request volume a replication controller can collaborate with another component called an auto-scaler, which monitors load via defined checks and then adjusts the replica count of the replication controller accordingly.

The second of the two independent abstractions that govern pods in Kubernetes is the service, and looking at it brings into view a whole area of functionality that ECS does not yet have: service discovery, name resolution, and internal load balancing. This subject is a complex one, with interesting solutions to puzzling problems, but probably needs an article of its own to be treated in any detail. For now what is important to know is that a Kubernetes service selects a set of pods and provides routing to them by service name. Underlying this capability is a component called a kubeproxy running on each node, along with skyDNS for name resolution. In practice this means that any service running in a Kubernetes cluster can address any other service by name, and the requests will be round-robin load balanced to as many instances of the pods running that service as are available to handle it.

In this area especially there is a pretty wide gap between the two services, as we’ll see.

Amazon’s solution, by comparison, is missing a lot of these pieces. ECS defines an abstraction called a service that groups tasks. In the main it has the same role as a Kubernetes replication controller, but also does the work of the autoscaler, and a small piece of the job of a Kubernetes service. A service on ECS, like a replication controller, is responsible for keeping a certain number of tasks of a certain type running. It encompasses auto-scaling rules, and has properties that will connect it to an automatically provisioned ELB if you want routing of requests from outside. It lacks Kubernetes support for service discovery, internal name resolution, and internal load balancing. In those cases where you need “horizontal” connections between services it is typical to use tools like consul or weave to implement discovery.

There are some other differences, with which I’ll wrap up this section. The first thing worth discussing is health checks, or more specifically liveness and readiness checks. By default on both platforms the controller in charge of a pod or task will consider a container to have died when the process inside of it exits normally or otherwise. When the container dies the controller will attempt to restart it. However it is not unusual to have cases where a process is running but unable to service requests. Kubernetes supports a fairly powerful set of separate checks for liveness and readiness. You can define an http endpoint or run a custom command in the container and decide based on its output. ECS has some of this capability when you wire a service to an ELB. The load balancer health checks come into play at that point, but for internal services there seems to be no way to refine what “alive” means.

The second thing to look at is the relationship between the kinds of controllers we’ve discussed in the last few paragraphs, and the pods or tasks they are controlling, and here we see another significant difference in design. Replication controllers and services in Kubernetes select pods to manage based on the value of “selectors” which are just labels or tags applied to a pod. This gives the structure a loosely coupled character. You can destroy and recreate a service without affecting the replication controller or pods, for example. You can remove a label from a pod and it will drop out of a service, or add one and it will join. In ECS the relationships are hard-coded in the task and service definitions, so the structure is much more tightly coupled in that sense.

The last point to touch on before I wrap up with a brief look at image repositories is versioning. ECS treats task and service definitions quite differently from Kubernetes. On Kubernetes a pod, replication controller, or service is a single named object serviced by a CRUD API. You can create, update, and remove the objects at any time, and the state of the cluster will follow along with your actions. On ECS a task or service definition is registered with the system, and then instances of this registered definition are applied to the cluster. So in a sense you have classes as well as objects. Moreover ECS stores a chain of named versions attached to each definition. You first is myTask.1, for example, and an edit will create myTask.2.

Ultimately the version of a pod or task is a combination of the version of the object that defines it, as well as the version of the image used to launch the containers inside it. If you are careful about tagging images when you push builds, then falling back is possible on both platforms, but the process would be a little different. On ECS you would activate the previous stored version of the task, which will reference the previous tagged version of the image. On Kubernetes you would most likely revert the pod (more likely the replication controller) definition to the last version, which also references the previous image tag, and then recreate the controller. Other scenarios are certainly possible, but in all of them the bottom line is that what runs in the cluster is what was installed into the image you pulled. And that is about as good a segue as I can come up with into the final topic of image registries and repositories.

Image for post
Image for post

Storing Images

Docker images are composed of binary layers that, taken together with each one representing a diff from the previous, make up a complete file system. The mechanism for transferring images, i.e. named collections of layers, between hosts is a central registry. When you build an image it is stored in your local image cache, and you then execute commands to “push” it to a centrally located registry that can serve as a distribution point… a download mirror of sorts.

For many users the Docker Hub is the central repository of choice. Not long ago the hub replaced the original Docker Registry with a new content-addressable storage scheme and a more consistent API. The Hub is open source code, and anyone can operate a private registry. Both ECS and Kubernetes can pull images from public or private Docker Hub repositories, so this remains an option, but each also offers private registries to users and for most production deployments these should be the best option, being both private and presumably closer in a network sense to the hosts that want the images. This is no small consideration. Docker containers streamline so much of the deployment process that in many cases push/pull times end up being the major factor in how long it takes to deploy a new build.

I don’t want to go into much detail about container registries in this post, primarily because they really have nothing to do with container orchestration per se. You do have to get your images from somewhere. You could stream them to your hosts without using a registry of any kind, but chances are very good you will use one, and that if it isn’t the Docker Hub then it will be either Google’s or Amazons, so I’ll briefly describe the differences as I see them. I haven’t had as much time to explore the way Amazon’s registry works, and I have some unanswered questions I am still pursuing, but even so a general description won’t hurt.

One of the problems for me in talking about image registries and repositories is that I’ve never been able to get the terms absolutely clear in my own mind. The Docker docs and tools don’t help, as they often use the word tag to mean either the name of an image or the version stamp that follows the : at the end of one, and they also use the words repository and image in similar situations. One way to get some clarity is ironically just avoiding the docs. If you do what comes naturally as a developer things seem to work out. That is, if you build an image to run redis you would likely name that image “redis” or “my-redis” or something, and it wouldn’t be ridiculous to tag it with the version. So you might push and pull it as “my-redis:3.0” for example. If you had an account on the Docker Hub as “george” then you would push it as “george/my-redis:3.0. Easy peasey. Makes a ton of sense.

Is “george” in that example an account? Definitely yes. A registry? A repository? I have no idea. What I do know is that on the Docker Hub and on Kubernetes “george” is 1:n with images+tags. That is, I can have “george/my-redis:3.0” and “george/my-elastic:2.0” etc. It looks and acts like a library of individual images, which is what I expect. If I push “george/my-redis:3.1” then I would expect to see both tagged versions and indeed that is the case, and one of the main features supporting graceful updates and fallbacks.

One of the problems for me in talking about image registries and repositories is that I’ve never been able to get the terms absolutely clear in my own mind.

Amazon’s offering is called Elastic Container Registry, and what appears to be the case is that each account gets an instance of this registry and can then create repositories in it. Here is where my understanding gets murky, but I can at least describe what I experienced and observed. On Google I can push an image to any project at any time and that image appears in the list of images in the registry for that project and can be immediately used in pods. If I execute the ‘gcloud docker push’ command and specify the registry url and the image “my-redis:3.0” then “my-redis:3.0” shows up in the library. This is good.

On Elastic Container Registry I was asked to create a”repository” during the first-run ECS console experience. I read “library” and named it “mark.” Later I authenticated the Docker client with the registry and tried to push “haproxy:1.6.” The push failed (after retrying many times over more than a minute) with the message that the repository did not exist in my registry. On a hunch I retagged the image as “mark:1.6” and it pushed. So at least in Amazon’s world “repository” and “image” do seem to mean the same thing, and it seems to be the case that you can’t push a new image without first creating a repository to hold it. The likelihood that this is true is probably equal to the chance that I’ve just misunderstood the whole thing. If I learn more I will follow up.

Image for post
Image for post

Final Thoughts

Container orchestration platforms like Kubernetes and ECS are going to completely change the way we deploy software in large-scale back end applications. If you’ve been in the business long enough then you have seen other changes come along that clearly were destined to succeed existing practices, and in my opinion this is one of them. So you’re very likely to adopt containerized builds and deployment in the future, and when you do it is very likely you’ll be running your containers on one of the big cloud providers. The intent of this article was not to advise on which, but just to give some high level impressions of the differences. In any case, having worked with Google’s offering for five months and Amazon’s for just a week I am not in a good position to judge both of them.

Both platforms are fully-featured and capable. Both are production ready, a fact I know from personal experience where Google is concerned and which I am prepared to accept in Amazon’s case based on what I have read and been told. Where the Kubernetes platform possesses features that ECS does not have there are open source alternatives that can complete the picture. You can run Kubernetes on EC2, but let’s face it: if you buy into a cloud provider’s infrastructure the tendency is to go all in. It’s always easier, and in many cases more cost effective and strategically beneficial to let them deal with it.

From my perspective, and I think as demonstrated in this post, the Kubernetes offering is more cohesive and complete. Because it was designed and created to be portable and stand on its own it presents a more unified and cluster-centric view of your applications. The differences are in the details, in the way information is presented, and in how many different systems you must interact with to get things done. I don’t think there’s any question that ECS is more proprietary, and that implementing an application on it will result in more vendor lock-in than on the Kubernetes side. Some vendor lock-in to a cloud provider is probably unavoidable, but if that is a key factor for your business then Kubernetes is probably your better choice. If it’s not and you’re already an AWS customer then ECS is going to be a natural candidate, and hopefully this piece has given you some idea of what you can do, and what to expect when you make the move to containers in the cloud.

Written by

Senior Devops Engineer at Olark, husband, father of three smart kids, two unruly dogs, and a resentful cat.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store