Kubernetes Architecture : Powerful and simple yet complex at the same time

Is using kubernetes the right choice for the application?

Published in

Life at Telkomsel

8 min readMar 29, 2023

Those questions above may be difficult for me to answer. However, after going through several migration processes that we have gone through, we can learn a number of things regarding the decision to use Kubernetes.
Maybe some of the cases that I will explain later will relate to your problem, maybe not.

Because even though on the other hand it can be simple for some parts, it’s back to the complexity of the configuration in Kubernetes.

Before going too far, what are microservices and kubernetes?

Microservices (as opposed to monolithic), according to AWS are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. These services are owned by small, self-contained teams.

Microservices architectures make applications easier to scale and faster to develop, enabling innovation, accelerating time-to-market for new features, reusable of code and resilience.

How it will be packaged? Microservices are often packaged as container images using container technologies, for example Docker and published to an image registry. So, one of the advantages of adopting microservices is granularity and avoiding single point of failure like in monolithic architecture.

Next question will be : How to manages the management and distribution of the containerized applications?

Container orchestration tools, e.g. Kubernetes.

What is Kubernetes/K8s?

Kubernetes, also known as K8s, is an opensource system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes jobs are deploying images and containers, managing the scaling of containers and clusters, resource balancing containers and clusters and traffic management for services.

Kubernetes as the orchestrator is in the middle, it will manage deployment, scheduling and ensures availability of the services that running in containers because it have self healing capabilities. I won’t go into detail about what all components and all functions are in Kubernetes, you can find detailed information and references here.

Suppose we don’t want to adapt infrastructure for Kubernetes or would rather avoid the complexity of building the cluster, it could turn to a managed Kubernetes provider to do the job for us, e.g. EKS in AWS, AKS in Azure, GKE in GCP, VMware Tanzu, OCP in RedHat, etc.

For example in AWS EKS, the user is not responsible for managing control manager, control panel, ETCD, API server that is located in the Control Plane. We as a users are responsible for customer data, authentications, authorizations, cluster and pod configurations and VPC as the underlaying network. This convenience will make it easier for us to focus on things related to the data plane that we can manage. The drawback is, we don’t have control over the control plane/master because AWS has managed it from the start.

Back to the original question, is using kubernetes the right choice for the application?

The answer is : it depends.

Most applications build architecture as a monolith. With an entire application in one place, it is quick and fast to do the deployment.
For a monolithic application with a static user base, this may be more than necessary. Does that mean it is time to move to microservices and Kubernetes? Probably not.

However, if the applications and traffic on the monolith application are getting bigger and growing, we will soon need to find ways to scale it.

Currently Telkomsel is on a transformation journey to migrate some workload applications to the cloud. Some applications are refactoring to microservices from monolith, some are replatforming because they had previously adopted microservices and some are still adopting monolith architecture.

Moving on, I will only highlight applications in Telkomsel that use Kubernetes platform, more precisely managed Kubernetes on Cloud providers. What are some challenges that we face and how do we mitigate them.

How to optimize resource deployment, since we are facing service degradation caused by node overcommited.
How to optimize resource utilization, because of unbalance deployment between nodes.
How to optimize networking utilization, mitigating constraint of IP and interface limitations.
How to optimize skills and capabilities team involved during Cloud transformation journey.

Node Overcommited

“Requests” is scheduling pods based on CPU and Memory requests only. Requests from each of these pods will immediately perform the reserved capacity on the worker node as defined, even though not all of it has been utilized.
“Limits” set an upper bound on what resources the pod can obtain before being throttled or run out of memory. This helps protect from a single pod going runaway with all the resources on a node due to some deficiency and also this is a self healing method to have resilience.

1 Kubernetes cluster, has several worker nodes, depending on the size of the workload running (read autoscale Kubernetes worker node), each of CPU and memory nodes will be reserved by the running pods.

Total limits of CPU and Mem of Pods deployed in node are overcommited

There are conditions where CPU or memory limits will be overcommited, meaning that it exceeds the capacity of the node. Typically, it’s normal when there is a case of overcommitting nodes, but that’s not a problem yet because its just a limit, not necessarily that each workload will consume up to the limits that are set.

The problem will arise when different pods could spike at the same time and this could cause CPU throttling or memory problems, will cause the node to have pressure and unhealthy so that Kubernetes will terminate the node. Need to be highlighted that pods are not evicted for reaching their CPU limit, only throttled but it might be caused application services will be down anyway. However if pods try to exceed their Memory limit they will be OOM (Out Of Memory) and need to be evicted.

One of the solution is utilize the Guaranteed Quality of Service Class by setting requests=limits to ensure that a specific workload does not risk eviction as a consequence of being overcommitted, another thing needs to be highlighted, running pods will still be evicted and terminated because it touched the limit, it only aiming to avoid being overcommitted.

Unbalance Pods Schedulling

Kubernetes has a workload scheduling function to which worker node it thinks is suitable for workload running. There will be cases where the number of pods deployed, consumed CPU and memory resources will not balance between nodes. This can be influenced by several things and indeed it will not be achieve an exact ideal balance because of the different configurations between each service deployments.

What is the strategy to be able to reduce unbalance but still pay attention to the performance as well as the latency of each services?

We can implement some of common Kubernetes infrastructure rules, such as Topology spread constraint, affinity and anti-affinity, taint and tolerations and also separate workloads based on node groups. Each of the rules has its own purpose, it must be adjusted to the needs and behavior of the service that we will deploy.

IP and Interface Limitation

Primary and Secondary CIDR implementation

Each pod running on the Kubernetes cluster will be assigned 1 IP, taken from the IP allocation that has been state in the CIDR which is divided into subnets. The problem is, Telkomsel has a centralized IP allocation which requires that IP allocations in the Cloud environment do not overlap with those IP On-premise.
Why? Because we are running a hybrid cloud, where several application channels run in the cloud and some of these environments are on-premise, so we still have to keep the IP requirements that will be used in this cloud environment optimal and as needed.

To avoid IP exhaustion due to limited IP allocation, for workloads running on the Kubernetes platform, we use custom networking with secondary CIDR which can only be consumed internally and for Kubernetes resources only which do require a lot of IPs.

The obstacles we face are not only IP limitations, but also interface limitations per worker node used. Each worker node has CPU and memory limitations as well as interface limitations, which require us to always review and tweaking so that the resources we deploy can be optimized and there are no wasted resources.

Comprehensive Knowledge Base

As mentioned before, that Telkomsel is currently in the Cloud transformation journey phase. It’s not only technology that we transform, people play an important role in all activities.

Our challenge is how can we balance the problems that arise when migrating application workload activities, managing BAU operations and upgrading skills. When there is one of these points that cannot be managed, operations will be disrupted, in the end our initial goal is to carry out a Cloud transformation for the purpose of optimizing costs, better performances, increasing realibility and availability, instead it is degraded and chaotic.

People must be transformed as well as knowledge, culture and mindset. We can no longer manage cloud technology like the legacy way. Because architecture and behavior in the cloud are very granular and also segregation between dev team and infra team can be very thin.

That’s why the DevOps culture is very close to cloud native, because the Dev team and Ops team are one unified unit that cannot be separated in term of development lifecycle. we have to keep learning, keep up with technological developments, especially in the cloud and continue to innovate to be able to get best practices and be able to oversee the goal of this cloud transformation to be achieved.

Simple but complex (?)

Migration steps are very challenging. There are a lot of dependencies and constraints as well. Regardless stack or technology will adopted, need to make sure that is suitable and application objective can be achieved.

Yes, transforming to microservices need a lot of effort to prepare, develop/refactor and learning how to manage moving forwards, even if the applications is born in cloud or cloud native.

Kubernetes is powerful, it has granularity and resilency. Simple, because configurations have an abstraction and can be declarative, yet is very complex in term of learning curve and complexity for some configurations.

Reference(s) :

Kubernetes

Amazon Web Services (AWS)

https://aws.amazon.com/eks

Telkomsel CCoE artifact