Production grade Kubernetes on AWS: 3 lessons learned scaling a cluster

Guy Maliar
Tailor Tech
Published in
6 min readSep 10, 2017

Articles about our lessons learned, tips and tricks running Kubernetes in production

We’re no strangers to errors and outages, every moment that our services our down we’re losing customers and losing money so it is important for us to be able to scale up easily and maintain availability at all time. We’d like to share some Kubernetes sharp edges and how we solved them regarding kube-dns and internal DNS, Cluster autoscaling, high availability and horizontal pod scheduling and pods requests and limits.

This is an article in our Production grade Kubernetes on AWS series, other parts are available here:

  1. Production grade Kubernetes on AWS: Primer (Part 1)
  2. Production grade Kubernetes on AWS: 4 tools that made our lives easier (Part 2)
  3. Production grade Kubernetes on AWS: 3 tips for networking, ingress and micro services (Part 3)
  4. Production grade Kubernetes on AWS: 3 lessons learned scaling a cluster (Part 4)

8. DNS and kube-dns

Three months into our Kubernetes journey in production we’ve started seeing DNS issues between our micro services, at first we thought it was happening around deployment times of those services and created some retrying logic but soon enough it raised our error levels beyond the normal levels and we’ve lost a few days worth of conversions due to the increased error rates.

Once we investigated it, we quickly understood, using this article from Kubernetes website, that our kube-dns deployment should be scaled according to the amount of nodes and pods we’re running in the cluster.

While kops sets up the DNS Cluster Autoscaler for you, it is up to you to tune the autoscaling parameters.

kubectl edit configmap kube-dns-autoscaler --namespace=kube-system# look for the line
# linear: '{"coresPerReplica":256,"min":1,"nodesPerReplica":16}'
# and tweak it to better suit your needs

We discovered that changing the configuration to 64 cores per replica and 4 nodes per replica works best for us as we are still trying both more nodes with less vcpus on them and less nodes with more vcpus on them.

The lesson we learned here is that once you see DNS issues in your internal Kubernetes DNS, it’s probably the kube-dns deployment.

9. Pod Disruption Budgets, Node Anti-affinity with Persistent Nodes, Cluster Autoscaler and Horizontal Pod Scheduling

We were running our Cluster Autoscaler as setting it up is quite easy and after hashing out all the issues with weave, we are very pleased with its performance.

Not long ago though, during a normal maintenance window, we mistakenly deleted a node that had our ingress controller, even though Kubernetes quickly resolved the issue by spinning up a new nginx-ingress pod on another available node. What we learned from that is that we obviously want to have more than one replica for critical services and we went one step forward and created three persistent nodes on three different availability zones and using nodeSelectors and anti-affinity rules we are now running three ingress controller pods, on three different nodes, one on each availability zone.

As we use Terraform, this is quite easily achievable. All we have to do is spin up three new auto scaling groups, one in each availability zone with a min/max set to 1, label those nodes in Kubernetes as persistent nodes and tweak the ingress controller configuration.

terraform plan # see that the plan only consists of creating three ASGs
terraform apply # apply the change

This template will spin three new nodes and reuse the kops launch configuration to connect them to our Kubernetes cluster.

Now we need to label those nodes as persistent nodes. Luckily it’s quite easy to assign labels to Kubernetes

All that’s left is to find the node id for the newly created instances

This issue can be solved much easier using kops instance-group cloud labels, but I was not aware at the time of this possibility, the following link shows you how to add labels to specific instance groups, I believe that at the moment, Terraform and kops labels aren’t working that great together, but I might be wrong.

Then we changed our cluster autoscaler deployment and ingress controllers as such

And the persistent nodes are up and running with our ingress controller on them.

Another improvement we’ve done is to add to the ingress controller service a Pod Disruption Budget and a Horizontal Pod Autoscaling rules that won’t allow us to drain all nodes that have our ingress controller.

10. Requests and Limits

taken from http://www.noqcks.io/note/kubernetes-resources-limits/

Kubernetes Pod requests and limits are, in my opinion, one of the hardest concepts to grasp and one of the most important ones to being able to successfully run your production workloads without having any clusterf*cks in your cluster. (pun intended)

Memory requests and limits are quite straight forward, you request some at the start (just be sure to request enough for your service to start and you limit the amount that a single pod can have). If you exceed that limit, your pod will get killed with an OOMKilled message.

CPU requests and limits are a bit more complicated understanding at first, I think that after reading our notes and these links it would be easier to understand.

Kubernetes CPU requests and limits are based around Docker CPU share constraint which in turn is based on the CFS scheduler.

It took us time to realize these things, but the requests and limits are proportional to one another on the same node, that means as per the docker reference examples that:

For example, consider three containers, one has a cpu-share of 1024 and two others have a cpu-share setting of 512. When processes in all three containers attempt to use 100% of CPU, the first container would receive 50% of the total CPU time. If you add a fourth container with a cpu-share of 1024, the first container only gets 33% of the CPU. The remaining containers receive 16.5%, 16.5% and 33% of the CPU.

The proportion is only relevant when all pods are trying to use 100% of the CPU at given moment of time. If we have more cores than pods, or we have pods which are not doing anything at the moment, a pod that requested 512Mi will still receive 100% of the CPU share.

As shown here:

For example, consider a system with more than three cores. If you start one container {C0} with -c=512 running one process, and another container {C1}with -c=1024 running two processes, this can result in the following division of CPU shares:

PID    container	CPU	CPU share
100 {C0} 0 100% of CPU0
101 {C1} 1 100% of CPU1
102 {C1} 2 100% of CPU2

What we found that works best for us is to first set up requests for both CPU and Memory and limit for Memory only and see how services behave.

You should especially monitor how software that utilizes multiple CPUs in parallel behave.

https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt

https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt

Closing notes

We’re very pleased with our transition to Kubernetes, we are confident in our infrastructure. Most importantly by extracting services out, switching technologies, shipping faster we are more confident in the code that we ship to production. we’ve increased our site performance and our visibility into the underlying software, seen an increase in site usage, we are more confident in our ability to scale, we’ve been able to better support our marketing, product and data teams efforts, tests and ideas. We’re deploying and rolling back in a matter of minutes instead of hours and we’re generally more happy!

If you’d like to learn more about Tailor Brands, you are more than welcome to also try out our state-of-the-art branding services.

You can follow us here, on twitter, facebook and github to see more exiciting things that Tailor Tech is doing.

If you find this of any interest, you like writing exciting features, create beautiful interfaces and do performance optimizations, we’re hiring!

--

--