A Whole Year of Amazon EKS
Is EKS all it’s cracked up to be?
I’ve been working with Amazon EKS since it became available in
eu-west-1. During my time with it, I’ve had some frustrations and surprises. From the offset, it became clear that this managed solution came with trade-offs that required some thought. The past nine months have been spent uncovering these limitations and benefits and now, you lucky ducks, I’m going to lay down some of my key experiences.
Pro: Protection by Default
I’ll cover this with a story. Some time ago, when we were experimenting with CI/CD solutions for our Kubernetes cluster, we thought we’d experiment with the Spinnaker helm chart. Spinnaker is a very interesting cloud-native CI/CD solution and some of its features were incredibly inviting. So what happened?
EC2 nodes were spinning up out of nowhere, consuming resources left, right, and center. This is an issue that has been reported but hasn’t been actioned by the folks who maintain the Spinnaker helm chart. The interesting part is what is contained within this issue. The original author reports that their Kubernetes API became “unresponsive”. Why is this?
You gotta protect those master nodes
When you run your own Kubernetes cluster, you have to take great pains to ensure that nothing can be accidentally deployed onto those master nodes. If you don’t, you risk losing your API and control of your cluster.
EKS gives you this out of the box, by hiding those managed nodes in an AWS VPC that you don’t see. When we installed Spinnaker, our API remained blazing fast and it was trivial to delete the helm release. We didn’t need to do anything for this.
When we started out on our Kubernetes journey, we would not have had the necessary knowledge to put those protections in place. It provided some guardrails to stop us from straying away from good sense.
Con: Low Customisation
As is often the case with a managed service, you lose a little of the flexibility that you would get by running your own. Nine times out of ten, this is fine. That flexibility might just be a toy that’s a little too dangerous for your experience level. However, there are some really useful tweaks that you can make to optimise your Kubernetes cluster.
pod-eviction-timeout determines how long pods can be unresponsive before they are booted off an unresponsive node and moved onto a working node. By default, this is five minutes. This is a sensible default, but for plenty of applications, five minutes of waiting is five minutes of money lost. Presently, you can’t change this in EKS because you’re not running the commands to bootstrap your cluster when it starts up.
But they’re working on it!
I raised an issue a couple of months ago about this and the AWS Container team are currently investigating it. Fingers crossed, they’ll be able to get it actioned and we’ll be able to configure low-level cluster parameters that will allow us to tune our cluster and improve on K8s’ already excellent resilience.
Pro: IAM Integration
This is a big favourite of mine. An EKS cluster ships with a
ConfigMap resource called
aws-auth in the
kube-system namespace. With this, you can map AWS IAM roles or users to internal Kubernetes user accounts.
This is incredibly useful when you want to organise developer RBAC permissions based on your existing IAM permission infrastructure. The file can get a little messy so just be careful how much stuff you decide to throw in there.
Con (Maybe a Pro?): Behind the latest versions
A few weeks ago, EKS released their AMI for Kubernetes v1.13. Now, v1.13 was released in December. The latest released version is v1.15, so as it stands, EKS users are two minor versions behind.
They plan to move to v1.14 around September, which means they’re going to be about six months behind the release. It’s a little bit frustrating when interesting new functionality comes out that you can’t access and it links back to the previous point about customisation. The nature of a managed service often means you can’t have the control you’d like.
But it’s not all grey…
This has a glimmer of goodness to it. These version bump releases often come with a few bugs and vulnerabilities which become quickly patched. If you don’t need the clever functionality in the new releases, then being 6 months behind will lower your operational overhead and keep things simple.
Pro: AWS Patches Quick
On the 3rd of December 2018, there was a critical vulnerability announced with the Kubernetes API. This set the world in a bit of a spin and was a major cause for panic. On the 4th of December, we ran
kubectl version and saw that the vulnerability had been patched. We didn’t have to lift a finger.
This was incredibly reassuring for both our engineering and our InfoSec teams. EKS comes with a flat charge of around $144 a month, and this is proof of them earning their keep, allowing the engineers to focus on other things and our InfoSec reps to know that some of the best engineers in the world have got our backs.
Pro (But for us, was a Con): The AWS CNI Implementation
In Kubernetes, the CNI is the interface that will allocate IP addresses to pods, amongst other things. In EKS, AWS has provided their own CNI implementation that integrates nicely with existing AWS logic. So for example, VPC flow logs will still work with your applications.
It does this by allocating an IP address from your subnet range to every single running pod. In a typical CNI setup, the pods themselves would get internal docker IP addresses and the node itself would be allocated the IP. So, this is good for integration with existing AWS networking, but it has a drawback.
Limited IP environments
For example, if you’re using Direct Connect to connect your private subnet to other AWS accounts or some on-premise infrastructure, you’re not going to have unlimited IP addresses.
In our case, we only had a few hundred to work with. Had EKS shipped with a typical CNI implementation, this wouldn’t have been that scary. We didn’t envision getting over a hundred nodes in our cluster… but pods?
Pods spawn like rabbits on cocaine, so we knew we would blow through our limit quickly. This posed an additional engineering challenge that we had to overcome. We got there in the end with a little networking magic (which I’ll detail in another article).
Con: It’s Behind Other Cloud Providers
GCP got there first since you know, they actually invented Kubernetes. Their offering lists out your services, allows you to edit Yaml directly, and view ingresses, ports and all sorts of other good stuff. The GCP offering is also free. You only pay for computing. If you’re in an AWS world, EKS makes a lot of sense, but if you have some freedom across cloud providers, GCP provides an outstanding offering.
So would I go EKS again?
The answer is EK-YES! (sorry). Seriously though, there are some limitations and complexities, but having a properly configured, highly available, multi-AZ K8s API ready to go was fantastic. The overhead of $144 a month is nothing for any organisation with more than a couple of apps. It’s a strong recommend from me.
I am regularly ranting about EKS, DevOps, SRE, Testing and much more on Twitter.