Fully Equipped Kubernetes Cluster

Clearwater Analytics Engineering
cwan-engineering
Published in
7 min readNov 3, 2023

Kubernetes is the de facto platform for managing containerized workloads, and all cloud providers have a managed service to support it. The Kubernetes ecosystem is big, and several components are available to perform the same task. To ensure smooth operations, administrators must choose and configure them correctly.

This article overviews some of its critical components and their purpose to ensure the cluster is fully equipped.

Overview

To start with, Kubernetes works on a client-server model where a control plane (aka master node) is needed to control every aspect of Kubernetes, and teams use a worker node to deploy different workloads. In any cloud provider, the control plane of Kubernetes is managed by the cloud provider itself and is abstracted from us. We can set configuration parameters, of course. However, that is limited.

Depending on the cloud provider, the worker nodes may or may not be available to us. However, all workloads are deployed on these nodes.

Managed Kubernetes Cluster Architecture

Now, what makes the above cluster fully equipped? Currently, the cluster is just a bunch of EC2 instances combined to listen to one master through API requests. We can deploy applications on this cluster, but we do not consider this a production-grade fully equipped cluster.

We need a layer of frameworks to handle distinct functions, such as the following:

· Controlling how the request routing from external networks to Kubernetes would be handled
· How the DNS of the deployed services within the Kubernetes cluster operates
· How the scaling works for different pods and nodes
· How application telemetry should be handled
· How we configure the cluster to be highly available, maintaining the RPO and RTO for the same.

Let us discuss some of the frameworks that make the cluster fully equipped.

Fully Equipped Kubernetes Cluster

Ingress Controller

This is the first entry point controller for the request coming from outside of the Kubernetes cluster (aka ingress). Unlike other controllers, the ingress controller does not start automatically with the clusters. There are many ingress controllers which are supported by various cloud providers.

There can be multiple ingress controllers within one cluster depending on the ingress class and type of request (public facing or private).

Autoscaling

Kubernetes clusters are known for availability and scalability. Two types of autoscalers are used in the world of Kubernetes clusters.

1. Node Autoscaler — Used when compute engines are behind the cluster’s worker nodes, which we or the cloud provider manage. There could be a possibility where the cloud provider abstracts the scaling part and scales the worker nodes behind the scenes (AWS Fargate being an example)

2. Pod Autoscaler — Can be horizontal (HPA) or vertical (VPA), which means the number of pods with the same configuration can increase or decrease, or the configuration in terms of CPU and memory changes and new pods spin up with the changed configuration.

Node AutoScaler (Karpenter)

Karpenter is one of the opensource and popular node autoscalers, which scales the nodes based on pending requests, not just the usual CPU and memory. It is known for just-in-time-nodes for any Cluster. Karpenter is supported by cloud providers like AWS, even after AWS built its own autoscaler.

Karpenter observes the aggregate resource requests of unscheduled pods and decides to launch and terminate nodes to minimize scheduling latencies and infrastructure costs. This node autoscaler has made the concept of grouping nodes disappear as it provisions nodes based on the pending request or the unscheduled pods.

Using an autoscaler like Karpenter helps optimize the cost of the overall spend in the cluster. It also makes the cluster more efficient by consolidating the nodes based on the similar type of pods.

Pod Autoscaler (KEDA)

KEDA is Kubernetes’ event-driven autoscaling, which synchronizes autoscaling with an event like a queue message, a database entry, Kafka topics, in-fact CPU metrics, and many more.

When scaling pods, the metrics play a significant role in keeping track of the CPU and memory used by each pod. This helps in scaling the pod based on utilization by the pod. However, the important part for any pod autoscaler to work is to define the pod resources through manifestation, specifically, the request of CPU and memory, which define how much a pod will utilize the resources by an estimation which the pod autoscaler scales the given Kubernetes object.

KEDA is dependent on this resource utilisation and event to trigger a creation of new pods as per the need and requirement.

However, both types of scaling can now be integrated with some sort of event. This can be done through KEDA. There are various scalers that can be used to associate pod autoscaling with specific events, allowing it to determine whether to use vertical autoscaling or horizontal autoscaling.

Disaster Recovery and Backup

Velero

We always talk about high availability and scalability; however, Velero comes first when discussing disaster recovery of the cluster. The backup and restore functionality of the cluster comes through Velero. It can have either a manual backup or schedule a backup based on RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

In a concise summary, RPO is the time before which all data needs to be available after an outage. For example, if the last available good copy of data upon an outage is from 5 hours ago, and the RPO for this business is 10 hours, then we are still within the parameters of the business continuity plan’s RPO.

In a concise summary, RTO is the time by which service should be restored. To put it in the example form, the RTO is the answer to the question: “How much time did it take to recover after notification of business process disruption? “. Moreover, persistent volumes attached to pods (stateful sets) can also be backed up and restored using Velero.

The feature that makes Velero stand out is the capability to back up and restore a specific namespace, deployment, or persistent volume. There are numerous options to store the backup based on the cloud provider. It could be S3 from AWS for backups and EBS Snapshots for persistent volumes with similar options for other cloud providers.

External-DNS

Inspired by internal Kubernetes DNS, this framework lets the Kubernetes services and ingress talk to the external world with the help of public DNS and makes Kubernetes resources discoverable by the outside world. It is not the DNS server by itself, but a collection of configurations used to expose Kubernetes services and ingresses through public DNS based on different cloud providers.

In a broader sense, external-DNS allows you to control DNS records dynamically via Kubernetes resources in a DNS provider-agnostic way.

Telemetry & Governance

Every software needs telemetry in place, which helps support the application, troubleshoot the issues, and monitor the application’s performance. The same goes for any application deployed in Kubernetes as well.

Kubernetes has evolved around telemetry, and many dashboards are available. However, all these dashboards need to be connected through an agent or agentless approach.

In fact, Kubernetes has its own dashboard that can be used for monitoring and governance.

Logging

Fluentbit is an agent that forwards/processes the logs and ships them to different dashboards like OpenSearch. This could be used as a sidecar container with each pod or as a Daemon set for each node to ship logs to the dashboard. This helps developers get a view of all the logs transmitted by any Kubernetes application in terms of Standard IO.

Governance

Governance is a set of restrictions on the Kubernetes cluster to adopt the best practices. Some are essential to meet governance and legal requirements. Others help ensure adherence to best practices and institutional conventions.

For example, companies want to put a restriction on all developers to access clusters from the same region only, to adhere to compliance and legalities, or an administrator wants to restrict developers to deploy workloads within one namespace only.

OPA Gatekeeper is the tool used to define policies and governance for the Kubernetes cluster. This is used to adopt the best practices through policies that prevent deploying workload within the cluster if the policies are not followed.

The tools/frameworks described in this article are mostly cloud-agnostic or may have an alternative from the cloud provider itself. The ones detailed above are available without any cost. There is no limit on provisioning the frameworks within a cluster. We can have more to optimize the cost, govern, and enforce the use of best practices through open policy management, and much more.

For more curious minds, Cloud Native Landscape is the place that shows the landscape of CNCF, showing frameworks across the domains to make the cluster more than fully equipped.

We showed the ways to make any Kubernetes cluster, be it cloud providers managed or standalone vanilla flavor to be fully equipped. The list of frameworks is endless, and each framework tends to be exclusive of its features. We tried to equip the cluster with ingress controller, autoscalers, both node and pod, Velero for disaster recovery, External-DNS for DNS resolution, OPA Gatekeeper for governance and Fluent Bit for Telemetry. The main driver for any new equipment is the business needs.

About the Author

Akash is a Senior Cloud Engineer at Clearwater Analytics and also Kubernetes enthusiast. He has over 8 years of experience, specializing in Cloud Computing, DevOps Engineering, Refracting and Designing Cloud-Native applications. He firmly believes in a statement — “Problem is nothing but a Question that has not been answered yet”. Connect with him through linkedIn — https://www.linkedin.com/in/akash-r-gupta

--

--