Kubernetes Recipe aka Kuberator

Published in

Alef Education

6 min readAug 1, 2019

In this blog, I have summarized our technical journey of running our platform on docker swarm clusters to a production-ready Kubernetes cluster.

Where were we before:

Docker Swarm clusters running in production and dev environments with no autoscaling enabled, fixed ports for every service. Quick scaling up was not possible.
Monitoring and Alerting spread across multiple tools with no consolidation.
Deployments would require a recreation of the whole swarm cluster to keep the IAC (bash scripts) in line and tested against production environment. (that will take ages).
Due to the recreation of clusters, typical release would mean a definite down time due to which dev team acquired bad habits.
IAC was an archeology mission for dev and ops teams.
Standardization was a hard target.
New environment on boarding/creation would take at least one week.

We are an AWS shop so everything I will be covering will be around AWS services including EKS. The journey begins with the network setup. Make sure the VPC has subnets in each availability zone and all the subnets have at least more than 2k ip addresses allocatable.(Please refer to the Gotchas section first)

Infrastructure as Code (IAC):
Starting with the whole infrastructure that includes the EKS cluster, worker nodes, etc. I wanted to control each and every aspect of it. Therefore I chose the terraform EKS module
(https://github.com/terraform-aws-modules/terraform-aws-eks).
But we did a lot of customization starting with worker nodes. We divided workers nodes into two main groups
a) System worker nodes
b) App worker nodes

For both groups, we created separate ASG per availability zone so we ended up with total six ASGs to service our EKS cluster. We applied specific k8 labels using the UserData of the ASGs to identify the ASG as system worker nodes and app worker nodes and also to identify which AZ the ASG is in. We used the required affinity feature to make sure all management k8 add ons launch and run only on the system worker nodes, unless and until they need to run as daemons where we do not need the affinity.

EKS optimized worker node AMIs do not come pre-built with kubelet protection strategy so after a while we started seeing our worker nodes going often into NotReady mode due to OOM etc . So I would recommend reading this blog explaining the strategy. I used the following:

/etc/eks/bootstrap.sh — kubelet-extra-args — node-labels=’lifecycle=Ec2OnDemand,az=1c,nodeType=systemNode’ — apiserver-endpoint ‘${aws_eks_cluster.k8-cluster.endpoint}’ — b64-cluster-ca ‘${aws_eks_cluster.k8-cluster.certificate_authority.0.data}’ ‘${var.cluster-name}’ — kube-reserved ‘cpu=250m,memory=1Gi,ephemeral-storage=1Gi’ — system-reserved ‘cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi’ — eviction-hard ‘memory.available<1Gi,nodefs.available<10%’

You can see above how we have used the labels for affinity purpose and how kubelet is being protected with hard eviction of problematic pods.

Kubernetes Add Ons:
After going through multiple public blogs and docs, we finally ended up with the following mix of add ons we used in dev and production.

Nginx Ingress Controller
External DNS
Grafana
Descheduler
Cluster Autoscaler
AWS CNI Helper
Helm
Metrics Server
Kubernetes Dashboard
Prometheus Adapter
ALB Ingress Controller
Fluentd
WeaveCloud with gitops (Flux)

I don’t think I need to explain each of the add ons functionality and purpose but I can shed some light on the usage of some add ons which were used in a different way than usual. Starting with Alb ingress controller and Nginx ingress controller. We had a specific use case where we wanted all the traffic to enter via ALB only so we first installed both the ingress controllers without any entry point and then we created a combined ingress.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
 name: “nginx-ingress-via-alb”
 namespace: “kube-system”
 labels:
 app: “myALBApp”
 annotations:
 # trigger the alb-ingress-controller
 kubernetes.io/ingress.class: “alb”
 external-dns.alpha.kubernetes.io/hostname : “test-k8.entrypoint.com” # set ALB parameters
 alb.ingress.kubernetes.io/scheme: “internet-facing”
 alb.ingress.kubernetes.io/target-type: “instance”
 alb.ingress.kubernetes.io/listen-ports: ‘[{“HTTP”:80,”HTTPS”: 443}]’
 alb.ingress.kubernetes.io/actions.ssl-redirect: ‘{“Type”: “redirect”, “RedirectConfig”: { “Protocol”: “HTTPS”, “Port”: “443”, “StatusCode”: “HTTP_301”}}’
 alb.ingress.kubernetes.io/certificate-arn: arn:xyz:1234
 alb.ingress.kubernetes.io/subnets: subnet-public1a, subnet-public1b, subnet-public1c
 alb.ingress.kubernetes.io/inbound-cidrs: 0.0.0.0/0
 # allow 404s on the health check
 alb.ingress.kubernetes.io/healthcheck-path: “/”
 alb.ingress.kubernetes.io/success-codes: “200,404”
spec:
 # forward all requests to nginx-ingress-controller
 rules:
 — http:
 paths:
 — backend:
 serviceName: “nginx-ingress-controller”
 servicePort: 80
```

So from the above manifests implementation, an ALB ingress controller will be created which will get attached to the nginx ingress controller pods and will forward all incoming traffic to the nginx ingress controller. The dns for each entry domain can be also seen in one of the annotations where external dns plugin will configure route53 with appropriate ALB cname. Multiple entry points that needs to be accessed from outside can be mentioned in the external dns annotations. Each ingress also needs to be created with nginx class so nginx ingress controller knows where to route the traffic to.

Continuous Delivery:
Continuous delivery, or as we call it, maintaining the state of the system was a challenge and we first started thinking about using the same old pipeline strategy via some CI/CD tool like Jenkins/GoCD etc and then we came across this idea of Gitops which was basically what we intended to implement anyway. Looking for k8 specific gitops tools we stumbled upon WeaveWorks. WeaveWorks not only offers a fully functional gitops tool but it also comes with prometheus monitoring setup which attracted us even more.

To cater to the gitops tool FLUX we arranged all our k8 manifests in github environment based repositories where we ran different environments in dedicated namespaces. For example, we have a dedicated repo for QA i.e qa-k8 and we have the following structure

qa-k8->
     qa->service1.yaml
         service2.yaml

Using WeaveWorks saved us investment on running and maintaining our own prometheus solution and they store the metrics data on their own servers so we get HA/DR based solution. Plus the weave cloud provides so much detailed dashboard with notifications, auto release, container state, descriptions etc. It is almost like a visual kubectl.

Monitoring Alerting:
Due to our specific requirements, we need at least two parallel monitoring alerting streams. Since we were already using NewRelic for app performance monitoring we also decided to use their infrastructure agent k8 specific that actually behind the scene runs a kube-state-metrics container that collects all the metrics. So NewRelic was getting all those metrics and based on those metrics we used Terraform Newrelic provider to create alert policies for eg container restarts, pods failing, pods memory usage against their memory limits etc.

The second monitoring/alerting stream we are using is WeaveWorks out of the box alerting monitoring solution where we have added our own customized alerts as well on top of their default alerts. The alerting channels are Slack and Opsgenie which we have configured for critical alerting thresholds.

Grafana can also easily use weaveworks prometheus as source to build custom dashboards for some k8 ingress monitoring.

Gotchas:

There were a lot of hurdles that we came across during this journey. Following are the few gotchas worth mentioning:

For stateful sets, there must be worker nodes ASG that are unique per AZ with specific AZ labels applied. Stateful sets should also have affinity configured to be launched only in one selected AZ. This is to avoid EBS volume reattachment issues.
EKS ip address range restrictions is a must pre k8 implementation read: IP ADDRESS range restrictions: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html#vpc-resize
If you decided to have all the worker nodes in private subnets then you must consider AWS CNI plugin configurations especially the env var “AWS_VPC_K8S_CNI_EXTERNALSNAT”.
Must use dns autoscaler for the coredns service if pods/containers are going 1k + per cluster.

To Do:

We are still a long way from the perfect setup but we are following the principles of continuous improvements, the plugins I mentioned above and the strategy for ASG worker nodes (per AZ) were part of improvements we did while evolving with k8.

I will share my github repo which contains the full terraform eks module including the add on plugins configurations( need to sanitize it ).

Kubernetes Recipe aka Kuberator

Where were we before:

Gotchas:

To Do:

Written by Humayun Jamal