Kubernetes onboarding journey of Myntra

Saurabh Kumar
Myntra Engineering
Published in
7 min readMar 9, 2023

Across Myntra we were using VMs (virtual Machines) for all our services. VM is dedicated hardware for a service. Once VM is provided by the IT team, all the necessary software required by the service needed to be installed by IT team e.g. tomcat 8, Java8 etc, then IT team needed to set up the health check and monitoring setup . This was a tedious, lengthy and error prone process as requirements were different for different services.

We also needed to scale up our systems to handle increased traffic during sale events and scale them down after the sale. Scaling up and down was time consuming task as we had to go through the lengthy process again to ask for a new VM with specific hardware requirements and install all the required software for different services. As a result, a lot of man hours were being wasted in just scaling up and scaling down.

What is Azure Kubernetes Service (AKS)

AKS is an open-source, fully managed, container orchestration Azure Public cloud service, for the deployment, scaling, and management of Docker containers and container-based applications, in a cluster environment. Azure Kubernetes Service provides provisioning, scaling, and upgrading resources on demanded in the Kubernetes cluster with minimal downtime. All the software installations needed to run Kubernetes e.g. Kubelet, are managed by azure and monitoring softwares, ingress, logging component etc. are installed by Myntra.

In AKS, apps and supporting services run on Kubernetes nodes. AKS cluster is a combination of one or more nodes. Nodes in the Kubernetes cluster are scaled-up and scaled-down according to the resource (CPU & RAM utilisation) threshold defined by Myntra.

Myntra integration with AKS

Transition to Kubernetes

We started Kubernetes migration with a few services. First we did a couple of load tests on these services to certify their performance on Kubernetes. Then we started serving a small percentage of actual production traffic from Kubernetes and and rest of the traffic from VMs. To achieve this, we used 2 levels of load balancers. One load balancer to route traffic between VMs and Kubernetes cluster and then another load balancer (ingress) within Kubernetes cluster to route traffic between different pods of a service as shown in figure below.

Traffic handling between VMs and Kubernetes

Kubernetes gives pod level metrics around CPU, Memory and IO utilization etc. We used these metrics to benchmark pod performance Vis-à-vis VMs. After we were satisfied with the performance, we gradually ramped up the traffic served by Kubernetes pods.

To our surprise, transition went smooth as a whistle and we haven’t faced any issues during or after migrating to Kubernetes. Just kidding !!. Here are the challenges that we faced.

Challenges Faced

1. Applications performance degraded

For vertx applications, we set deployment options such that number of event loops and verticles depend on the cpu cores.

We observed that the application was creating a lot more event loops and verticles than desired, hence all event loops were not getting adequate cpu cycles, resulting in application performance degradation. It also resulted in bloated memory footprint due to explosion in the threads created.

We observed that the application pods were not honouring container boundaries and were using more resources than what was defined for the application.

Memory limit breached

Solution

JVM started recognising memory or cpu limits set by the container from Java 10 onwards. This feature was then back ported to Java-8u191. Now Java8 provides a jvm argument named -XX:+UseContainerSupport. This argument makes jvm container aware. After this change, we saw pods honouring resource boundaries, and application performance improved significantly.

If you are using java 8, Move to Java-8u191 or switch to a higher java version. Java 11 and java 17 are LTS versions.

2. Request timeouts on service deployment

We use a very lightweight health check on our services to make health check as fast as possible. With such lightweight health check, there is a disadvantage though.

Image Credit: How to Fix the 504 Gateway Timeout Error on Your WordPress Site | Fix it, Proxy server, Wordpress site (pinterest.com)

Our health check turned green as soon as the service came up and load balancer used to send traffic to the pod, leaving pod with no time to warmup caches, setting up connection pools etc. This resulted in slow processing of requests and we would see 504 timeouts and Response time spike for initial few minutes after deployment.

504 timeouts during deployment

Solution

We added a warmup time. So the load balancer doesn’t direct traffic to a pod until the first health check has passed and on top of that the warmup time has elapsed, giving pod sufficient time to come to an equilibrium state before starting to serve traffic.

Unintended Consequence: After adding warmup time, It used to take a pod, a couple of minutes before it could serve traffic. This time delay caused an issue with pod autoscaling. At high load, by the time new pods started serving traffic, existing pods were coming under too much load and started flapping. We resolved this by relaxing out horizontal pod auto scaling thresholds.

3. Monitoring issues

In Myntra we have setup dashboards to monitor failures and response times in critical flows. Metrics are pushed at node (pod/vm) level. Dashboards then aggregate metrics from all nodes to form an aggregate view at a service level. A typical metric source looks like the following.

<metricType>.<service>.<nodeId>.<Context>.<metricName>

For VMs node id is constant and doesn’t change with each deployment/restart.With Kubernetes since older pods are killed and newer pods are spawned with each deployment/restart, new metric sources are generated every time.

Due to this, dashboards have to aggregate data from a lot of sources to show service level metrics specially when looking for data in a long time window. This slows down the loading of dashboards. It used to take a couple of minutes to load dashboards.

Solution

Earlier, we used to keep 6 months of data for monitoring dashboards. A very few use cases needed 6 months data, for vast majority of use cases, we needed only 1 month of data. To solve slow loading of dashboards, we did horizontal scaling of our monitoring system and tweaked our data retention policy. After these changes, dashboards started loading within a couple of seconds.

Benefits Post Kubernetes Migration

1. Optimising IT costs

Once the services stabilised on Kubernetes, an analysis of resource utilisation on pods revealed that most of our services were under utilising the cores allocated (well under 30 %). Since azure costing is primarily based on core usage, We took this as an opportunity to optimise our hardware cost. By following the below mentioned thumb rules, we could reduce our daily core usage by around 40%.

Image Credit : https://granulate.io/

Thumb Rules for deciding on No. of cores :

  • Core usage under normal load ~ 40–50%
  • Memory Usage under normal load ~ 60%
  • Memory limit for pod— considering 50% buffer of Xmx
  • Min pods per service — 3

2. Standardise deployment process

Image Credit: Create Kubernetes Manifests files Quickly — Knoldus Blogs

Earlier each team had their own way of doing health check, with Kubernetes adoption, out health check became standardised and all teams started following common deployment process.

3. Easily scalable infrastructure

Autoscaling in action (Image credit: https://www.virtuozzo.com/)

Kubernetes provided an easy way of upscaling/downscaling our services as per the need. We also integrated our deployment tool with HPA (Horizontal Pod Autoscaling) so services can auto scale based on the CPU & RAM threshold defined at service level. So during sale events, services can scale automatically as per the load.

4. Increased productivity

Kubernetes adoption with Horizontal Pod Autoscaling integration, freed up DevOps bandwidth. It also maximised DevOps delivery while improving on automation and application delivery.

Scale Down (Image credit: https://www.virtuozzo.com/)
Scale Up (Image credit: https://www.virtuozzo.com/)

Summary

Here we share our experience of Myntra’s Kubernetes onboarding journey, challenges faced along the way and benefits reaped out of it. During transition, we had reserved VMs for complete fallback in case of any issues. This obviously resulted in increased cost for infrastructure, added an additional hop in all request processing and made monitoring more complex. Our recommendation would be to keep this transition period as short as possible.

Further Readings

  1. Kubernetes — https://kubernetes.io/
  2. Azure Kubernetes Service — https://azure.microsoft.com/en-in/products/kubernetes-service/
  3. JVM Container Aware Issue — https://spring-gcp.saturnism.me/deployment/docker/container-awareness
  4. Kubernetes best practices — https://learnk8s.io/production-best-practices

AuthorsShiva Tiwari, Saurabh Kumar

--

--