Creating a Kubernetes Cluster on AWS

Published in

Connect.CD

8 min readSep 5, 2017

Amazon Web Services is currently the most widely used cloud provider by market share, however it doesn’t as yet have full native support for Kubernetes. That’s why we decided to first focus our efforts on providing a clean and extensible pathway, to create and bootstrap a non-trivial Kubernetes cluster on AWS.

How we got here

Our efforts with Kubernetes on AWS began in May 2016 with Kubernetes v1.2.2. Although we understood and fully appreciated the technology in terms of where it would allow us to take our projects, the tooling and setup remained quite disparate — it’s fair to say it was a bit rough round the edges and we encountered several roadblocks along the way…

We used kube-aws to instrument the creation of our cluster with CloudFormation for tear-up. As the use of CloudFormation locked us in to AWS, it wasn’t the most ideal and future-proof way of doing things, however we had engineers hungry to get their code into a working cluster! Kube-aws ‘just worked’ enough to sort our immediate need.

So while kube-aws provided us with what we needed to get going, back then it lacked high-availability support and the ecosystem surrounding it was moving rapidly. Regular breaking changes were introduced which meant in order to perform an upgrade of our cluster, we often had to completely destroy and recreate it. In an enterprise environment, we knew this wasn’t sustainable.

Another challenge we faced is a common one in software — architecting. As with almost everything in software, there are numerous ways of architecting and performing various tasks. We knew if we were going to invest more heavily in Kubernetes, we’d want more customizability and a set of tools we understood in greater detail — a bit like our own version of Kelsey Hightower’s seminal ‘Kubernetes the Hard Way’.

Getting the Boat Ready to Tack

At ClearPoint, we’re strong proponents of the guiding principles of Continuous Delivery as espoused by Jez Humble, Martin Fowler et al — in particular the focus on immutable, verifiable infrastructure as code. We also have a well developed organizational focus on test automation and all our projects that use continuous integration and deployment, have a large number of automated end-to-end tests. These tests enable us to exercise the numerous internal and external integration points that all large projects use, which enable us to release code to customers as frequently as possible.

It’s this goal of being able to deliver releases to customers as quickly as possible which we’re all driven by. A green CI build is like gold to us, yet anyone who has worked in a full CI environment knows achieving that goal consistently usually results in a treadmill of anguish, frustration, and numerous late nights.

In a distributed microservices focussed architecture, assuring the overall integrity and cohesion of the entire software stack is crucial — especially the infrastructure stack. When we started this journey one of our goals was to work towards creating a software development environment which was entirely codified. We quickly found our achilles heel was our infrastructure code. As our projects often evolved at a breakneck pace, this would frequently manifest itself in infrastructure that was mostly codified and instrumented, but still had areas where something ‘just had to be done manually’ — usually as a ‘temporary’ measure (I put this in quotes because we’ve all seen something that was meant to be temporary yet still finds itself in production well past anyone’s definition of ‘temporary’!).

In a distributed microservices focussed architecture, assuring the overall integrity and cohesion of the entire software stack is of paramount importance to us, especially the infrastructure stack.

Our infrastructure stack and best practice has evolved over the years well before we met Kubernetes — from bash scripts, to Chef and Puppet, to Ansible — we’ve tried and used in anger — almost every tool out there. We inevitably found that while we could have automated tests and a software development process which gave us an ability to reason about our system and its stability in incredible detail, our infrastructure layer was not as decoupled as we would have liked it to be. This meant as we progressed through various environments — sandbox, integration, development, test, UAT, and finally through to production — there were often subtle differences in setup which made debugging and solving various issues fairly difficult. Similarly, when we were faced with production readiness test cycles, we knew that our ability to perform various destructive testing scenarios such as automatically recovering from a catastrophic failure, would be challenging.

The solution to these issues is a code and infrastructure stack which is entirely committed to VCS — no manual changes should need to be done, ever. That’s much easier said than done though! While fully containerised and immutable microservices, end-to-end tests and Kubernetes for blue/green deployments brought us tantalising close to that goal, our infrastructure still wasn’t as formally verifiable or immutable as we would have liked.

Introducing Terraform

Terraform is built with a number of architectural principles that we really like in mind — parity across environments, easy collaboration, testability, and composability. We are big users of Ansible as it’s incredibly flexible and powerful; however we found in a large environment with numerous engineers working on various components of our microservice architecture, it became difficult to maintain consistency of style and cohesion across the stack as we scaled out.

We believe in giving engineers the flexibility to choose the tools and methods they believe are fit-for-purpose. As the tooling available to build our infrastructure was evolving so quickly, we needed a way of ensuring we allowed for that flexibility in a way which was controlled and tested to the same standard as our actual microservice code.

A number of somewhat philosophical discussions ensued:

How decoupled should our infrastructure be?
Should a microservice create its own infrastructure or should it just configure it?
How could we make our services as stateless and scalable as possible and what trade-offs would we have to make to achieve that?
How could we adhere to architectural guidelines and requirements from our clients that were sometimes in conflict with our desire for a completely immutable and continuously integrated stack?

Many of these questions fuelled the genesis of the ClearPoint Connect project. We wanted Connect to be a way we can offer open source solutions to the problems we encounter every day — problems we assumed would be also felt by teams around the world who are also trying to build modern, microservice-based and cloud-first solutions.

Let’s Tack!

Tack is an open source Terraform module that helps us create a highly-available, redundant Kubernetes cluster on AWS. Tack works in three phases which are defined in a Makefile:

Pre-Terraform: This stage prepares the environment using a shell script by doing things that are hard or messy to do in Terraform
Terraform: The Terraform stage does all the actual resource sequencing, tearing up and configuration of our environment
Post-Terraform: Once Terraform has done its magic, we need to then wait until the master ELB shows a healthy status — this is a good indicator that our cluster is ready to go

The Makefile contains the required parameters for our cluster, such as the AWS Region, cluster name and Network IP Addresses. It’s also possible to define the repository channel for CoreOS and Hyperkube — an instrumented version of Kubernetes the Hard Way. Let’s step through how we can use Tack so try and solve some of the problems we’ve encountered.

Creating a New Cluster

In this tutorial, we’re going to explain how to deploy a new cluster using Tack. First, we need to ensure that the following packages are installed in our environment: awscli, cfssl, jq, kubernetes-cli and terraform.

If you’re using macOS, you should be able to just run

brew update && brew install awscli cfssl jq kubernetes-cli terraform

to get underway. This ensures we have the AWS command line tools, CFSSL for managing TLS certificates and keys, jq for pretty JSON output, the Kubernetes command line tools for managing Kubernetes, and Terraform for infrastructure instrumentation.

Once we’ve got all these pre-requisites installed, we need to configure the AWS credentials we want to use which are required by awscli. More info on how to do this is available from the official documentation. If you’re doing this from scratch, follow the instructions in the documentation and once you’ve configured your AWS credentials, you should be able to run aws iam get-user which, if everything is configured correctly, should output details of your IAM user and account details.

Now we’re ready to get started with Tack. Let’s begin by cloning the Tack repo. Run

git clone git@github.com:kz8s/tack.git && cd tack

One of the reasons we chose to use Tack was that once all your pre-requisites are installed, creating your cluster is as easy as running make all and destroying it is as easy as running make clean. But let’s not get ahead of ourselves – we’ll first walk through what this Makefile does and how you can use it to customise your Kubernetes cluster.

Tack comes with a Makefile with some helpful defaults that allow you to get started immediately. As you’ll see below, we can configure some basic settings of our cluster:

export AWS_REGION ?= us-west-2
export COREOS_CHANNEL ?= stable
export COREOS_VM_TYPE ?= hvm
export CLUSTER_NAME ?= test
export AWS_EC2_KEY_NAME ?= kz8s-$(CLUSTER_NAME)
export AWS_EC2_KEY_PATH := ${DIR_KEY_PAIR}/${AWS_EC2_KEY_NAME}.pem export INTERNAL_TLD := ${CLUSTER_NAME}.kz8s
export HYPERKUBE_IMAGE ?= quay.io/coreos/hyperkube export HYPERKUBE_TAG ?= v1.7.4_coreos.0
export CIDR_VPC ?= 10.0.0.0/16
export CIDR_PODS ?= 10.2.0.0/16
export CIDR_SERVICE_CLUSTER ?= 10.3.0.0/24
export K8S_SERVICE_IP ?= 10.3.0.1
export K8S_DNS_IP ?= 10.3.0.10
export ETCD_IPS ?= 10.0.10.10,10.0.10.11,10.0.10.12
export PKI_IP ?= 10.0.10.9

Most of these configuration options should be fairly self-explanatory if you’re familiar with AWS. You can choose which region you want to deploy into, the CoreOS channel, VM type, Hyperkube image version to use and what you want to call your cluster. You can also customise various networking components such as your VPC ranges for Pods, DNS, etcd and so on.

As explained in the Tack documentation, once you’ve edited your Makefile to your preferred configuration, run make all and Tack will create the following artifacts and infrastructure components for you:

AWS Key Pair (PEM file)
AWS VPC with private and public subnets
Route 53 internal zone for VPC
Bastion host
Certificate Authority server
etcd3 cluster bootstrapped from Route 53
High Availability Kubernetes configuration (masters running on etcd nodes)
Autoscaling worker node group across subnets in selected region
kube-system namespace and addons: DNS, UI, Dashboard

We’ve put together an asciinema demo of the cluster creation process if you don’t want to create your own cluster right now. But you really should since tearing it down is as easy as make clean so you won’t be leaving any machines or resources running inadvertently.

https://asciinema.org/a/135708

Once you’ve got yourself familiar with the Makefile process, have a look through the README and the contents of the Tack repo for more detailed instructions on configuration.

Our thanks go out to the team that have worked on Tack!

We hope that we can contribute back to the project once we get the ball rolling on Connect.

Prev: Bootstrap Continuous Delivery with Connect

Next: Watch this space for Part 2 of our series where we’ll run through how we can use our new Kubernetes cluster to setup CI using Jenkins.

Originally published at blog.connect.cd on September 5, 2017.