Container Orchestration with Kubernetes: An Overview

Adrian Chifor

Published in

Onfido Product and Tech

12 min readMay 30, 2017

In this post we’ll go through:

An introduction to containers and container orchestration and what the landscape looks like in 2017.
An overview of Kubernetes and how we provision, monitor and scale it on AWS.
How we deploy our microservices to Kubernetes at Onfido.

🐳 Containers are hot right now

If you heard the term “containers” in a tech context before, it was most likely referring to Docker or rkt (pronounced rocket). They are both open-source container engines that provide a layer of abstraction and automation of OS-level virtualisation on Windows or Linux.

You can think of containers as lightweight, scale-able and isolated VMs in which you run your applications. You can link containers together, set security policies, limit resource utilisation and more.

Some of the key differences between Docker and rkt:

Docker was released first and it was already quite popular when CoreOS launched rkt in late 2014. I personally prefer rkt because of the lighter and more security-oriented implementation, but in this post we’re just going to look at Docker because it’s more popular and has more mature, battle-tested tooling.

So what happened to Docker in the last few years? As of 1st May 2017:

More than 14M Docker hosts, with more than 900K Docker apps.
77,000% growth in Docker job listings in the last 3 years.
Docker adoption has increased 40% in just the last year.
More than 12B Docker image pulls (accounting for 390,000% growth).
More than 280 cities hold Docker meetups, which accounts for more than 170K members worldwide.

Container Orchestration

Container Orchestration refers to the automated arrangement, coordination, and management of software containers.

Why do we need this? Let’s start with the following diagram:

If your current software infrastructure looks something like this — maybe Nginx/Apache + PHP/Python/Ruby/Node.js app running on a few containers that talk to a replicated DB — then you might not require container orchestration, you can probably manage everything yourself.

What if your application keeps growing? Let’s say you keep adding more and more functionality until it becomes a massive monolith that is almost impossible to maintain and eats way too much CPU and RAM. You finally decide to split your application into smaller chunks, each responsible for one specific task, maintained by a team, aka. microservices.

Your infrastructure now kind of looks like this:

You now need a caching layer — maybe a queuing system as well — to increase performance, be able to process tasks asynchronously and quickly share data between the services. You also might want to run multiple instances of each microservice spanning multiple servers to make it highly available in a production environment…you see where I’m going with this.

You now have to think about challenges like:

Service Discovery
Load Balancing
Secrets/configuration/storage management
Health checks
Auto-[scaling/restart/healing] of containers and nodes
Zero-downtime deploys

This is where container orchestration platforms become extremely useful and powerful, because they offer a solution for most of those challenges.

So what choices do we have? Today, the main players are Kubernetes, AWS ECS and Docker Swarm, in order of popularity. Kubernetes has the largest community and is the most popular by a big margin (usage doubled in 2016, expected to 3–4x in 2017). Personally, I also like Kontena a lot, mainly because it’s much easier to setup compared to Kubernetes, but not as configurable and not very mature.

Kubernetes (aka. k8s)

Kubernetes is an open-source platform for automating deployments, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.

It addresses all the challenges we’ve described above, it’s very portable (can run on most cloud providers, bare-metal, hybrids or a combination of all of the above), it’s configurable, modular, and does features like auto-placement, auto-restart, auto-replication and auto-healing of containers extremely well. By far, the best thing about Kubernetes is the awesome community with its online and in-person meetups in every major city in the world, KubeCon (yes, there’s a Kubernetes conference), tutorials, blog posts and the massive amounts of support that you can get from Google, the official Slack group and from major cloud providers (Google Cloud Platform, AWS, Azure, DigitalOcean etc).

The project was started in 2014 by Google and builds upon the cluster management system Google uses internally called Borg, plus ideas and best-practices from the community.

Kubernetes Core Concepts

Master node: Runs multiple controllers that are responsible for the health of the cluster, replication, scheduling, endpoints (linking Services and Pods), Kubernetes API, interacting with the underlying cloud providers etc. Generally it makes sure everything is running as it should be and looks after worker nodes.
Worker node (minion): Runs the Kubernetes agent that is responsible for running Pod containers via Docker or rkt, requests secrets or configurations, mounts required Pod volumes, does health checks and reports the status of Pods and the node to the rest of the system.
Pod: The smallest and simplest unit in the Kubernetes object model that you can create or deploy. It represents a running process in the cluster. Can contain one or multiple containers.
Deployment: Provides declarative updates for Pods (like the template for the Pods), for example the Docker image(s) to use, environment variables, how many Pod replicas to run, labels, node selectors, volumes etc.
DaemonSet: It’s like a Deployment but instead runs a copy of a Pod (or multiple) on all (or some) nodes. Useful for things like log collection daemons (sumologic, fluentd), node monitoring daemons (datadog) and cluster storage daemons (glusterd).
ReplicaSet: Controller that ensures a specified number of Pod replicas (defined in the Deployment) is running at any given time.
Service: An abstraction which defines a logical set of Pods and a policy by which to access them (determined by a label selector). Generally it’s used to expose Pods to other services within the cluster (using targetPort) or externally (using NodePort or LoadBalancer objects).

For more information about concepts, visit the Kubernetes docs.

The easiest current way of running Kubernetes is in Google Cloud Platform, mainly because of the great k8s support, tools, free master nodes (you only pay for worker node instances), easy upgrades, and overall cheaper to run compared to other cloud providers.

Azure recently announced native support for Kubernetes clusters but I’ve heard some horror stories (probably a topic for another blog post).

At Onfido we already had everything running on AWS so we’ve decided to just stick with it as it would have taken a lot of time and resources to move. AWS also has more mature tooling and great overall support compared to other public cloud providers.

That does mean that we have to do a lot of things manually in k8s that Google would automatically take care of, but ultimately this gives us more control over how we do provisioning, maintenance and what we run on our clusters.

Provisioning k8s on AWS

There are 4 recommended ways of provisioning a Kubernetes cluster on AWS:

At Onfido, we were looking for a tool that can deploy a highly available and production grade k8s cluster easily (bonus if it supports Terraform and it’s free).

Initially we tried the first version of Kraken. It had support for Terraform and it was free, but it didn’t have a big enough user base, it was quite complicated, it wasn’t production-ready and it didn’t produce HA clusters. Similar case with kube-aws: not production-ready and no HA clusters.

We finally went with kops, because it satisfied all our requirements: HA and production-grade clusters, support for Terraform, it’s free, and in addition, it’s maintained by the core Kubernetes team.

CoreOS Tectonic came out after we already had our production cluster setup with kops, so we haven’t had a chance to try it out yet. However, if you’re looking for a managed solution (free up to 10 nodes) for provisioning and administration of production ready k8s clusters on bare metal or AWS, Tectonic looks like a good bet.

Setup k8s on AWS with kops+Terraform

Install kops and Terraform:

macOS: brew install kops && brew install terraform
Linux: Get the latest kops binary from here and latest Terraform binary from here.

Create a new bucket on S3 to keep kops, Terraform and k8s related configurations and state files. Add the following line to your .bash_profile or .bashrc:

export KOPS_STATE_STORE=s3://your-k8s-bucket/kops

Go into a new folder and start the initial setup of the cluster with kops:

The create cluster command will create the blueprint for the cluster (templates, state files and configurations on S3) according to the options we’ve set. Note that this will not create any AWS resources (EC2, EBS, LB, DNS etc.) just yet.

The edit cluster command will show the kops template for the cluster. In this step we just want to pick the k8s version to run on our cluster and make sure that everything else looks ok.

The update cluster command will update the kops state and generate the Terraform file for the whole cluster infrastructure setup, role policies and launch configurations for the masters and nodes, and put them in the current directory. Still no AWS resources created.

At this stage, you might want to go through the Terraform file and modify things to match you desired setup:

Create a Terraform state S3 backend
Update base AMI (defaults to Debian, but we recommend going for stable CoreOS)
AWS Key pair name
VPC Subnets
Security Groups
Instance types
EBS sizes
Extra AWS resources

Try to standardise the Terraform file (with variables) so future updates and new cluster setups will be much faster. Double check role policies. Also, append additional things to the launch configurations if you have to at this stage as it will be harder to modify them after the cluster is up and running.

If you want to add additional services to the cluster startup like the k8s dashboard, kube-dns, state-metrics, log collectors (fluentd, sumologic), monitoring (datadog, prometheus) etc., you can upload the k8s templates to s3://your-k8s-bucket/kops/<cluster name>/addons as kops will create bootstrap-channel.yaml at that location which includes all services that will be applied at cluster startup. Just add your services into the bootstrap file and reference their name and location in S3.

Finally, to setup all the AWS resources and bootstrap the cluster, in the same folder as your Terraform file, do:

terraform plan and if everything looks good, terraform apply

After the apply is finished, you should now a HA k8s cluster with nodes and masters spanning across multiple availability zones, ready for your microservices.

In case you need to teardown your cluster:

terraform destroy && kops delete cluster <cluster name>

Deploying microservices to k8s

We use Jenkins with a customised pipeline for CI within Onfido. We’ve written a DSL on top of Jenkins Pipeline that makes it easy for a team to configure a standardised deployment for a new service. The overall deployment flow is described in the diagram above.

Let’s take a closer look at the steps of our custom pipeline:

Before pushing, a microservice needs to satisfy two requirements:

The project folder must contain a deploy folder which contains the k8s Deployment templates for development and production, respectively. We use ktmpl to write the Docker image build tag or number of replicas per Pod into these templates and generate the final deployment files at the deploy stage in the Jenkins build. The naming convention we chose for the templates is <deploymentName>-template.<development/production>.yml.
The Jenkinsfile must declare a kubernetes context, like in the following snippet:

When the code is pushed to the development branch of the project:

BitBucket will trigger a build in Jenkins that will build, test and push a new Docker image to the Docker registry (in our case AWS ECR)
Create or update the deployment on the development k8s cluster
The cluster then pulls the new Docker image from the registry and does a RollingUpdate of the deployment (strategy that allows for zero-downtime deployments).

For the master branch, manual approval is required before the deployment goes into the production k8s cluster.

Secrets

If the environment variables of the microservice have to be taken from k8s secrets, the secrets('example.yml') configuration can be specified like in the kubernetes context in the Jenkinsfile above. The example.yml file is located in an AWS S3 bucket (preferably encrypted) and looks like:

Multiple k8s secret files can be specified in secrets(..) configuration and they will be applied onto the k8s cluster (development/production) at the deploy stage in the Jenkins build.

Ingress

In order to make our microservice accessible from outside the k8s cluster we need to create a Service for it:

In the example above, we are creating a Service which listens on nodePort 30101 and will round-robin TCP (by default, can also be UDP) traffic to Pods with label selectors app: example and env: development listening on port 5000.

If we take out type: NodePort and nodePort: 30101 then the example microservice will only accept traffic from within the k8s cluster on example.development.svc.cluster.local:5000 (internal DNS if you run kube-dns).

The Service nodePort will be exposed through the container’s networking interface (CNI) — in our case Flannel — which runs as a DaemonSet. That means that if we hit any node of our cluster on port 30101, the traffic will get forwarded to the example microservice Pods.

If we run our cluster inside an AWS VPC we need to have a public AWS ALB (Application Load Balancer, aka. ELBv2) pointing to our cluster worker nodes to make our microservice publicly accessible. To setup and configure the DNS+ALB for the microservice we use Terraform at Onfido but another great alternative is to use a k8s ALB ingress controller.

An overview diagram of the traffic flow:

Logging, monitoring and autoscaling

For logging we run a fluentd collector as a DaemonSet that sends all container, docker and kubernetes logs to Sumologic where we can filter for specific clusters, Pods, times, use fancy regex etc. If you want to host the logs privately, I would suggest setting up ElasticSearch+Kibana and running a logstash collector on your cluster as a DaemonSet (aka. ELK stack).

Regarding monitoring and metrics we run a datadog agent as a DaemonSet that sends the info to our DataDog account. Alternatively, you can run Prometheus which also works very well, but it’s a bit more of a hassle to setup and maintain. For simple metrics you can also run Heapster+InfluxDB+Grafana.

In terms of autoscaling the cluster nodes (as part of AWS ASG), we tried the official node autoscaler recommended by kops and the OpenAI k8s autoscaler but they didn’t work very well (the kops recommended autoscaler doesn’t support multi-AZ ASGs and the OpenAI k8s autoscaler only scales up when Pods are pending — as they mention, batch-optimised) and were slightly over-complicated for our needs. We wrote our own simple node autoscaler that works quite well so far.

For generic Pod autoscaling, we use the Kubernetes HorizontalPodAutoscaler that scales Pods based on the CPU usage and the target you set. We also have Pods that consume RabbitMQ for which we implemented a custom but simple pod autoscaler.

Maintenance

There’s a few things that you should keep an eye on if you run your Kubernetes cluster outside of Google Cloud Platform:

Every container you run within your Pods should request an accurate amount of resources (CPU and RAM). If you do this, Kubernetes will always schedule your Pods exactly where they have to be and you will always have full control over the provisioning percentage in your cluster.
Make sure container logs on your nodes are rotated appropriately (this should be done automatically on CoreOS).
Run a Docker cleanup Pod on every node to make sure you don’t have random exited containers and dangling images/volumes.
Cleanup old ReplicaSets by declaring .spec.revisionHistoryLimit in your Deployments, cleanup old and finished jobs, or just run a replica sets/jobs cleanup CronJob.

As an extra tip, make sure your masters and nodes have these ports open if you use CoreOS as your distribution.

Takeaways

If you’re not using containers yet, give them a try, you might be pleasantly surprised.
Split your huge applications into microservices (if feasible).
If you want to easily manage and scale your microservices, run them on Kubernetes, preferably on Google Cloud Platform if you’re not tied to any cloud provider or you don’t have a dedicated ops team to manage your cluster.
For provisioning Kubernetes on AWS, kops and Terraform work brilliantly. Run your masters and nodes on CoreOS if possible.
Sumologic works well for logs, DataDog/Prometheus for monitoring and metrics.
Always request resources and try to keep things clean and tidy.