This is the Part 2 of the our Kubernetes migration series, you can read the Part 1 on why we decided to adopt Kubernetes here.
This part assumes a technical expertise; along with some working knowledge on Amazon Web Services (AWS), Linux, Docker and Kubernetes.
If you are new to Kubernetes, it is recommended to have a working level knowledge of Kubernetes. Linux Foundation’s Introduction to Kubernetes course helps to get started. Jérôme Petazzoni’s Container Training Workshop covers it in depth.
This part consists of 3 topics — Preparation, Execution and Learnings.
We believe that it is important to take care of the following things during the implementation.
Production readiness of running Kubernetes
Production Readiness of running Kubernetes, for us, means that the base infrastructure is architected to meet a certain Service Level Objective (SLO) based on business requirements.
Roughly speaking, it involves the following:
- Designing for high availability at all layers (Compute, Network, etc.)
- Network/VPC planning to accommodate requirements put forth by EKS.
- Hardening and layered defence in depth security (RBAC, non-root containers, TLS, Whitelisted Network CIDRs)
- Right sizing Worker Nodes based on workloads.
- Workload Limits [Ensuring applications (deployments) running inside a cluster to have: resource limits, pod disruption budgets, pod affinity based on worker node groups, horizontal/vertical pod autoscaler, etc.]
- Stress Testing the workloads to understand the behaviour to ensure Horizontal Pod Autoscalers (HPAs), Cluster Autoscaler work as expected.
- Rudimentary Chaos Engineering test of killing worker nodes in random and ensuring the Nodes get re-provisioned, PodDisruptionPolicy (PDBs) work as expected.
This list is not exhaustive, but this gives you a start on where we are going with this.
Knowledge on Failure Modes
When running infrastructure, it is important to understand the failure modes. With Kubernetes, there are even more moving parts to be taken care of.
Kubernetes has come a long way since its inception 2014, but there are still some rough edges. There are 100s of ways to shoot oneself in the foot operating Kubernetes.
kubernetes-failure-stories contains a list of stories and post-mortems. It is imperative that all those experiences are read carefully and understood to avoid repeating the same mistakes.
Monitoring and Observability
You can’t manage what you don’t monitor
Apart from monitoring regular Infrastructure components, we now have additional things to monitor.
It is important understand the additional key metrics associated with running Kubernetes and monitor them. Some of them include CoreDNS metrics, CNI metrics, etcd metrics, etc. We use Datadog for Kubernetes Monitoring.
Similarly, log the events and look out for anomalies. Often, the log messages have accurate information on the underlying problem. All our logs are shipped to Scalyr.
There are a huge ecosystem and alternatives for these tools. All these tools have their own advantages and challenges. Pick one that suits your use-case.
Operations and Upgrades
Operating Kubernetes is a lot like piloting an airplane. Although the scheduler takes many decisions autonomously, knowledge of how its internal components behave is essential to operate!
Irrespective of a managed or unmanaged setup, operating Kubernetes entails understanding its complete internals and an understanding of how scheduler behaves on different conditions.
Having the kubectl cheatsheets bookmarked will help as it has most of the commands that operators may use on a day to day basis.
From time to time, it is important to upgrade the control plane. That also involves upgrading its dependencies. As the project is still maturing, upgrading will offer more features, functionalities, vulnerabilities and bugs fixed. A zero down time upgrade for workloads is very much possible as Kubernetes is designed to allow worker nodes to go offline anytime for maintenance.
There are lots of posts, workshops out there to work with different Kubernetes implementations. We will focus more on the execution strategy for migration that worked for us.
- Scoping: We decided that we will run only stateless workloads on Kubernetes. Any stateful system will be a managed service from AWS or provisioned to be run on EC2. As we have a few product lines, we will start with one of the least complex and go from there.
- Spike Phase: Automate the cluster and node-group creation using Infrastructure as code tools. Configure the cluster, ingress. Setup peering to access stateful resources like RDS, ElastiCache, etc. Get the application working. Understand the caveats.
- CI/CD pipelines: Since the environments are actively used by different teams, we decided a parallel environment in Kubernetes will be created. The pipelines have to support that.
- Non Production Implementation: Proper Manifests for applications with resource limits, Pod Disruption Budgets (PDBs), Horizontal Pod Autoscalers (HPAs), Kubernetes Services, Ingress definitions, Node affinity, etc. All deployments from pipelines were able to update both the legacy/existing environment and Kubernetes cluster based on certain flags.
- Application validation: Ensure Smoke and Extended regression tests validate applications are working fine.
- Factor in timelines for data migration and cutover in DNS.
- Repeat this for other non-production environments (like staging) and finally production and similar data.
- OnCallDuty (OCD) process: Establish OCD process for engineers and equip them with trainings/expertise to operate Kubernetes.
Our Infrastructure Stack
- Amazon EKS with Amazon Linux AMI for worker nodes and therefore Amazon CNI as the networking plugin with a customised bootstrap script, all automated with terraform-aws-vpc and terraform-aws-eks.
- Separate Clusters on different VPCs and accounts for Non Production and Production Environments.
- Separate Labelled AutoScaling NodeGroups (Worker Nodes) within each cluster for different product lines to contain the blast radius.
- Datadog for Monitoring, Scalyr for Log management.
- Helm, Kubernetes Ingress (Nginx), ClusterAutoScaler, ClusterProportionalAutoscaler, DeScheduler, etc.
- AWS Secrets Manager for Secrets Management.
In a way, during the migration we learned about multiple things and it is covered in the last section below.
Learnings a.k.a The Aha! moments
Not all “managed” offerings are the same.
We use a lot of managed AWS Services, so it was a no brainer to go ahead with Amazon EKS, the managed offering from Amazon.
Unlike other “managed” services like RDS or Elasticache, where it is usable right after provisioning, “managed” EKS requires a lot of plumbing before it can be used .
While using Infrastructure as Code (IAC) tools like Terraform or AWS CloudFormation helps, there is quite a bit of work to be done to get the cluster operational and production ready.
EKS authentication and AWS IAM
Marcin Kaszyńsk has written an excellent article on how authentication works in EKS with IAM Users and Roles.
Access to our AWS accounts are authenticated with our Federated Single-Sign-On system based on ADFS. We use
saml2aws to authenticate and issue AWS temporary credentials under different AWS Profiles. Like most implementations, we have a direct mapping of our ActiveDirectory (AD) Groups to specific IAM Roles to allow SAML-Federated Access to the AWS Resources.
The terraform module we use for EKS provisioning helped us to map the same IAM role(s) in
aws-auth configmap. This made our onboarding (and offboarding) process much easier.
Note: Access control depends on your Workload’s threat model and Organization’s Security policy. YMMV.
Worker Nodes and IP address availability
EKS by default, favours fast IP address allocation to pods, which results in a limit in the number of pods that can be run on a worker node. This happens because the Container Networking Interface (CNI) plugin of EKS reserves a lot of IP addresses using the Worker Node EC2 instance’s VPC ENI (Elastic Network Interface).
The actual number of available IP addresses per node is based on the instance family and size. eni-max-pods.txt contains an up-to date parse-able list of maximum available IP addresses of various instance families and sizes.
The newer versions of CNI plugin allow a configurable IP Address pool with this special parameter —
The benefit using this parameter is that it allows effective utilisation of the VPC address space and potentially running even more pods on the nodes.
The trade-off of using this parameter is that beyond the availability, the scheduler has to request the IP using EC2
AssignPrivateIpAddresses() API call, which can add latency to the pod startup times.
Now, choosing between faster startup of Pods vs. effective utilisation of the VPC address space depends on the workload, engineering and business needs.
Dependent Component Upgrades
Amazon EKS upgrades handles Control Plane upgrades only.
Upgrades to CoreDNS, kube-proxy, CNI plugin are one-off manual activities and have to be handled separately before finally “replacing” worker nodes.
When using automation, changing NodeGroup names terminates running Worker Nodes and will cause a downtime.
Therefore, the following strategy will work to upgrade Nodes with zero downtime. Ensure that Kubernetes Cluster Autoscaler (CA) is running, functional before doing the below step.
Cordon => Drain => Removal of node from AutoScalingGroup (ASG) [using aws-cli].
Rebalancing pods for High Availability
This is important when worker nodes are terminated during an upgrade. As scheduling happens only once; during pod creation, there is a possibility that pods are un-evenly scheduled across the nodes.
From time to time, desheduler should be run as job, to ensure that all the pods are (re)balanced across all the nodes.
This is important to be run after worker node(s) are upgraded with a Zero downtime strategy one by one.
Effectively using Kubernetes Infrastructure components
The migration from spring-api-gateway to Kubernetes Ingress was very straight forward and it even showed in our tests that the request latency was reduced between services.
The api-gateway component also hosted Swagger for our API documentation. That was replaced with the official swagger docker image, directly utilising the API JSON endpoints required for Swagger to render the API documentation.
The same way, we were able to validate JWT tokens with Ingress using its External Authentication feature. That way, we didn’t have to validate tokens at each micro-service and it was handled transparently at the Infrastructure layer. This one feature actually brought down the implementation time from weeks to hours.
As of writing, we have completed the migration of a few product lines to be hosted out of Kubernetes and feel comfortable with operating Kubernetes and conducting OnCallDuty (OCD) process too. We are also aware that our knowledge on Kubernetes has room for improvement and hopefully that it will improve as we migrate more product lines to Kubernetes.
On that note, we have lots of work to be done in that area. Few more product lines, data science workloads — all at a big scale. And we are hiring. So if you are are experienced DevOps person passionate to work on such challenges, you are welcome to apply to us here.