For those who do not know anything about Kubernetes, Docker or AWS, I advise digging deeper into the subjects before starting this journey with me. A basic understanding is required to gain insight into the choices I have made and the hurdles I have crossed. I have been actively experimenting with Kubernetes for about 8 months now and I have learned a lot about the underlying Kubernetes infrastructure, tweaking and setting up your cluster with kops and deploying it on Amazon AWS. The road to where I am now was sometimes very painful with a lot of sleepless nights.
Kubernetes Operations (kops)
Kubernetes Operations (kops) - Production Grade K8s Installation, Upgrades, and Management - kubernetes/kops
First things first. There are several ways to create a Kubernetes cluster on AWS. You can do it all by yourself by creating Auto Scaling Groups on AWS, setting up the right base image, creating the right Security Groups, etc. This is a lot of work which is reasonably hard to do if it weren’t for kops. Kops does this all for you, with a simple command and minor tweaking.
kops create cluster [flags]
I won’t go in a lot of detail, there are other posts who go in depth on setting up a Kubernetes cluster with kops.
A few things to keep in mind when deploying your cluster with kops. Before you know it, you can tear it down again because you didn’t think things through! This happened to me several times. It is important to think first before you set up your cluster. It is easier to do this at the start than with a live cluster. Some things to consider:
- kops uses default t2.medium EC2 instances
These are already costly if you plan on launching a small cluster (which is exactly what happened to me, I didn’t specify the instances I wanted to use, so my bill went up very fast).
- already think about which topology you want
It is rather difficult to change your topology (without downtime) once you have a working cluster. You can choose between a private topology or public topology. In a public topology, each master node and regular node will be open to the outside, in a private topology these nodes will be behind AWS Load Balancers. Beware that a private topology requires AWS NAT configuration, which is also billed per internal request, these costs can get quite high if you didn’t anticipated this (again, I learned this the hard way).
- use bastions or don’t use bastions
If you have a private topology, meaning that the individual nodes are not directly accessible, you would want to think about setting up bastion nodes. These nodes will have their own Auto Scaling Group in AWS and will make sure you can SSH into them and access your nodes from inside your private topology directly from there.
- the amount of master nodes
You can choose how many master nodes you want. It is possible to run on 1 master node (which I am doing right now), but there is a trade off you will have to make. The more ideal scenario would be to have 3 master nodes (uneven number to avoid split brain issues). Running on 1 master node has the following consequences; rolling upgrades will take your master down, as well as your whole cluster at that moment and if AWS experiences issues in an availability zone you will have no other master nodes to rely on. If it isn’t your priority to have an SLA of 99%, you’re good with 1 master.
- size of EC2 instance master node
Another important thing is the machine type of your master node. If your master node doesn’t have enough resources, it will begin to act strange. Random outages will occur and you’ll scratch your head on the reason why this happens. Always make sure you put your master node(s) on instances with enough memory and cpu, too much is better than too little resources.
General-purpose web UI for Kubernetes clusters. Contribute to kubernetes/dashboard development by creating an account…
Now that your cluster is running, there needs to be a way of monitoring what your cluster is doing, this is something the Kubernetes dashboard will do for you. This UI can be accessed by setting up kubectl on your computer and running following command, after running this command you can access it by going to http://127.0.0.1:8001 in your browser.
kubectl proxy --port=8001
I already went into detail in another Medium post about all of the Kubernetes components like Deployments, Replica Sets, Stateful Sets, Daemon Sets, Persistent Volume Claims, Storage Classes, etc.
EDIT: I recently switched to Traefik, check out my new post.
gorilla: The Cloud-Native API Gateway. Contribute to Kong/kong development by creating an account on GitHub.
Next in line is Kong, this will nicely gather all of your API’s in one place. If you don’t use Kong, you will probably have to deploy an AWS Elastic Load Balancer (ELB) for each of your applications and put them in contact with for example a Route53 route. AWS Load Balancers are costly for what they do, I personally only have one Load Balancer which resides in front of Kong and where all of my domains are routed to.
Deploying Kong is a little tricky if you don’t know Kubernetes that well. If you follow the Kong deployment guide, you will deploy Kong on your cluster and it will work great, except for one little problem, if your Postgres pod dies, so does your Kong routing data (I again learned this the hard way). I recommend using Stolon for persisting data in Postgres on Kubernetes, this works extraordinary well. After you deployed Stolon on your cluster and made a Kong database, you can deploy Kong and use the persisted Postgres database in your Stolon cluster. This will ensure you have persisted Kong configuration.
By deploying the Kong service, you will notice that it contains “type: LoadBalancer”, this will create an ELB in AWS, be aware of this. It will set up the Kong admin (HTTP/HTTPS) endpoints behind port 80 and 443 on the ELB. This will ensure that all traffic routed to the ELB via AWS Route53 will get into Kong. There is also a dashboard for Kong which you can deploy and use, this will simplify your interaction with Kong.
Dashboard for managing Kong gateway. Contribute to PGBI/kong-dashboard development by creating an account on GitHub.
So now you have a Kubernetes cluster on AWS, deployed with Kops, a working Postgres cluster and a working Kong deployment with a configured AWS ELB, congrats! Don’t cheer too soon, there is still some work ahead, bear with me. Next in line is making sure your applications are accessible from the outside through regular urls. This is done by setting up Route53. The only thing you have to do know is to set your name servers, from for example Namecheap or wherever you bought your domain, to the nameservers in Route53. You then have to point all of your new domains in Route53 to the freshly created ELB, from here on out Kong will take over all requests that enter that ELB.
Spot Instances on AWS
If you are familiar with AWS, you probably know that you can utilise spot instances. AWS has a lot of infrastructure it does not use from time to time, this depends on the demand of instances. These instances are sold to bidders which are mostly just a fraction of the price of an on-demand instance, only downside is that Amazon can reclaim them without prior notice. I thought it was a good idea to do this, because if you think of it, Kubernetes is build to withstand loss of underlying nodes. With kops this is very easy to do, just add the following line of maxPrice to your kops config. This denotes the maximum price you are willing to pay for a spot instance, which in this case is the same as the price of an on-demand instance. This will ensure that you don’t pay more for a spot instance than for an on-demand instance.
The underlying configuration in kops is done by editing files which you can access by running following command, if the name of the instance group of your nodes is called “nodes” and the instance group of your masters is called “masters”. You can also edit your cluster file which contains general information about your cluster, like the Kubernetes version it is on.
kops edit cluster
kops edit ig nodes
kops edit ig masters
There is however a huge downside to spot instances, it is a trade off you will have to make: all of your nodes can disappear in the blink of an eye. Amazon can always reclaim spot instances, if you are out of luck, all your instances will be gone. POOF! PANIC!
Although I have seen this once or twice, chances are very slim that all of your instances go away at the same time, this happens mostly with only one instance at a time. The time it takes before they come back is not that long, ranging from a few minutes to a few tens of minutes depending on the supply of spot instances of that specific EC2 machine type.
AWS EBS Volumes
An important thing to keep in mind is the usage of EBS volumes on AWS. There are several types of EBS volumes, the most widely used are
io1. EBS volumes of type
gp2 are far less expensive, but hold a burst balance, which means that it can only hold high rates of IO for a small amount of time. If the burst balance is 0, your EBS basically stops with doing any IO until it can recharge. Very important to keep this in mind if you have applications which need very high IO for longer periods of time, for example if you are running ElasticSearch and need to rebuild indices. The other type of EBS volumes is
io1, these volumes don’t have a burst balance and will perform consistently, but are a lot costlier. Choice wisely and think about your burst balance if your application suddenly slows down on IO.
I was setting up a cryptocurrency wallet on my Kubernetes cluster, which obviously requires syncing the entire blockchain, and noticed that it was going very, very slow. I didn’t know what was going on at first, until I decided to check the underlying EBS, because of course syncing a blockchain is very IO heavy. As you can expect, the burst balance was indeed 0, it was basically doing nothing at all. I switched the EBS to an io1 type and everything went blazing fast from that point, after syncing the blockchain I switched back to a gp2 instance, et voila.
AWS CPU Burst Balance
Another important thing to think about is the cpu burst balance some instances on AWS have. For example the T2 series does not have dedicated CPU power, they rely on a burst balance which can be monitored in the AWS console. The T2 series are meant for applications which receive bursts of load at which the cpu burst balance declines. If it reaches 0 your cpu performance is rubbish and slow. It recharges continuously but it will not be enough if you have long running jobs on your instances at all times. Switch over to, for example, the M5 series which have dedicated cpu power at their disposal.
T2 instances are not ideal for running a Kubernetes cluster. I had issues with performance of my applications and did not think of the cpu burst balance at first. I thought it was related to my Postgres and Redis clusters running on my Kubernetes clusters. After some further digging I noticed that the cpu burst balance was at 0. Your burst balance begins to decline when the load on your system exceeds 1, which means that you cannot use your cluster to its full potential. Running a Kubernetes cluster probably means that you are doing monitoring with Prometheus or doing logging with an efk-stack. These all require constant cpu performance. After switching to the M5 series on AWS, my cluster has never ran better at just a small increase of the cost.
Kubernetes Requests and Limits
It is very important that you always set your requests and limits as accurately as possible, for Kubernetes and for yourself. They are a little tricky at first to understand, but are quite self-explanatory. You can set requests and limits on Deployments, Stateful Sets, Daemon Sets, Pods, etc. Requests denote what your application wants to use and is preferably going to use. So if you have a Deployment of an application which is happy with 500Mi of memory and 1 CPU, meaning that it will preferably consume 500Mi of the host’s memory and 1 of the host’s CPU’s. Whereas a request is a soft limit, a limit is a hard limit, if your application passes your limits it will be killed. Something to keep in mind, if you don’t see any errors but your pod is restarting, chances are high it is getting killed for passing its limits. The resources can be set in your yaml configuration file of your Deployment, Pod, etc.
Downsides of not setting your requests just right:
- let’s say your application uses 1000M of memory instead of the 500Mi you said and Kubernetes has scheduled it on an underlying EC2 instance which only had 600Mi free memory, what is going to happen? Your EC2 instance will become unresponsive because all of its memory will be used till the point it cannot work properly anymore. So who’s to blame? Kubernetes should know this and fix this, right? It is not so easy as you’d think, how will Kubernetes know how much resources your application will consume? The only one who can guess this and get this right, is you.
- let’s say your application only uses 250Mi of memory, this is good right? Not quite. You will not run into other issues, because the underlying EC2 instances will have enough spare room, but you will underutilise your EC2 instance. You will waste a lot of precious money by not setting this right. After some investigation I found out I could save $50 on my $180 monthly bill just by setting my requests just right and not wasting any resources.
Downsides of not setting your limits just right:
- let’s say your application uses 500Mi of memory and your request is set to 500Mi, everything is good. Unless you set your limit to 500Mi, what will happen is that if your application uses 500Mi, there will not be enough headspace to burst before getting OOMKilled (OutOfMemoryKilled). It is best to set your limits a bit higher than your requests, for example 750Mi, so if the underlying EC2 instance has some memory left, it can give it to your application when needed.
Eviction thresholds are used to evict pods from nodes which are suffering from a lack of resources. There are soft and hard eviction thresholds, soft eviction thresholds can be used and add an extra time parameter to your condition, so let’s say when the available memory on the node is below 200Mi for 5 minutes straight. Hard eviction thresholds do not have this time parameter and pods will be evicted from the node when the available memory is below 200Mi. It is possible to pass these options to the kubelet process, so if you use kops, this can easily be done.
The default is set to 128Mi, which I find risky, so I have set them to 256Mi. If you have set your requests and limits it should not happen that your node runs out of memory, but if a lot of pods use a little more memory than you requested, it is possible it will clog up your node. So it is never a bad idea to give your nodes some headspace.
If the available memory limit has been exceeded, your node will be labeled with a
MemoryPressure node taint. A
DiskPressure taint will be set if any of the disk-related limits are exceeded.
If you are as unlucky as me, this will happen with your only master node and you will not even be able to do anything about it because your api server will not be responsive enough to make changes to your cluster. My only option was to use a bigger instance for my master node so the kubelet process and apiserver would be responsive enough to make changes or evict pods when necessary.
Kubernetes has 2 kinds of probes, readiness probes and liveness probes. Readiness probes are used to monitor if traffic should be sent to the pod or not, if it fails its readiness probe, Kubernetes will not route traffic to that pod anymore. Liveness probes are a little more aggressive, if a pod does not respond to a liveness probe, the pod will be terminated and another one will take its place. It can really help you setting these on your pods, I was having a problem where Kubernetes scaled down all my pods and put up new ones after a deployment update, this is good because this is how it should be, but it resulted in my applications being down for a brief moment because it takes a moment before those new pods can process new requests. If you set a readiness probe you avoid that your pods receive traffic before they are actually ready. You can specify the initial delay and which kind of probe to use, an HTTP probe or TCP probe.
Each entity in Kubernetes has a restartPolicy, which can have the following values
Always. Be aware that the restartPolicy of a Job can be set to
OnFailure, so it cannot be set to
Always. Keep in mind that if your pod is killed due to limits being crossed or a ChaosMonkey has killed it, it will not be rescheduled, use Deployments for indefinitely running applications. Jobs are meant for running a certain task to completion and nothing more.
Node Taints and Tolerations
Nodes can be tainted to inform Kubelet to schedule or omit to schedule on a certain, you can view al taints on your nodes with the following command
kubectl describe nodes | grep Taints . For example, your master node is tainted to not schedule any pods which do not tolerate this taint. If you have issues with scheduling pods on certain nodes, chances are this is due to a tainted node. You can find more information here.
I encountered a certain taint that told me that there were impaired volumes.
NodeWithImpairedVolumes=true:NoSchedule This means that an EBS volume is stuck in an attaching state for more than 30 minutes. It was not easy to find why this happened exactly. After some digging, I found some documentation on AWS which indicated that each node has a maximum amount of EBS volumes which can be attached. For my m5 instances, this number was 28.
I haven’t really found a reason why this taint was being set on my nodes, it interrupted the working of my cluster fairly regularly. I created a deployment to remove this taint if it occurs.
Lessons Learned: Recap
- Think about your initial setup with kops
- Pick the right EBS volumes for your cluster and Stateful Sets
- Consider spot instances to greatly reduce costs
- Monitor your CPU EC2 burst balance and IOPS EBS balance
- Always set appropriate requests and limits for Kubernetes and for yourself
- Set up Kong/Traefik to make a structured API pool
- Set up Kubernetes dashboard to have a clear overview of your cluster
- Carefully pick your restart policies
- Always set your readiness probes to avoid timed out requests
- Think about setting soft or hard eviction thresholds to protect your nodes
That’s all folks, thanks for bearing with me! Remember to educate yourself about Kubernetes but don’t forget it is alright to make mistakes and fail, the more you fail, the faster you’ll learn.