Running Cost Efficient EKS Clusters

Browse one of our multi-brand online stores and it’s likely you will be served pages that were generated by one of our Kubernetes clusters. This post discusses the how the Kubernetes Platform team at YOOX NET-A-PORTER GROUP have built the clusters to run production and non-production workloads and be cost efficient.

The Kubernetes clusters run Amazon EKS with EC2 nodes. There is an hourly cost for the EKS cluster control plane plus the costs of the EC2 nodes (by the hour or second instance costs, EBS costs, data transfer costs, load balancing costs). As the EKS control plane cost is an hourly fee per cluster where the only lever we have is to not run the cluster, the area to focus on for is the EC2 nodes.

When EKS on Fargate was launched, we evaluated the costs of EC2 Spot vs Fargate. For the number of requests we serve and the 24x7 nature of the online stores, EC2 Spot is a better fit for our needs.

The two levers that are available are rates and usage where Cost = Rate x Usage.

Reducing Rates

The trade-off for the discount and flexibility is the risk of instances being interrupted by Amazon EC2, with two minutes of notification, when Amazon EC2 needs the capacity back. Due to this, Spot instances are not suitable for all types of workloads.

In the case of the YOOX NET-A-PORTER stores’ micro front-end, these are stateless so Spot instances can be used with the worker nodes in the EKS clusters.

Comparing what we were billed for Spot usage against on demand pricing, we are seeing between 47% and 66% saving using Spot with the average saving of 59%. Higher discounts can be attained on other instance types (particularly with some of the older generation instances).

Spot price history

Given that 3-year RIs or Savings Plans give up to 72% saving, achieving 66% with Spot instances for our usage without having any term or usage commitments is great.

Handling Spot Interruptions

Managing Usage

Scale down out of office hours

There are 168 hours in a week and if teams only need environments for 12 hours a day, 5 days a week (60 hours) a 24x7 environment is unused (wasted) for 64% of the time. With the utility based billing model of public cloud, this is cost that can be avoided.

With kube-downscaler the pods that don’t need to run 24x7 are scaled down.

If the nodes are left running idle once pods are scaled down then you don’t achieve any savings so autoscaler and descheduler come into play to evict pods that can be moved off an under-utilized node and adjusting the EC2 Auto Scaling Groups settings. In addition to the cost avoidance from the instance hours, you also get an avoidance on associated costs such as EBS.

Scaling in nodes outside working hours and scaling as required during the working day

Auto-delete branches on PR merge

Beyond GitOps: How we release our Microservices on Kubernetes at YNAP has more details on how we release our applications into our clusters.

Things to watch out for

AWS Config Costs increase

This results in lots of EC2 Auto Scaling Group changes, just look at the auto scaling chart above. Each time a node is scaled out or in, there are changes to the Auto Scaling Group values, EC2 launch/terminate events and updates to the security groups to add/remove the EC2 instance. AWS Config records all these configuration events and charge per configuration item recorded.

There is a change to how AWS Config records relationships coming in August 2021 (see Example : AWS Health event output section for the details) that will deprecate recording indirect relationships. This will reduce the number of configuration changes recorded and is particularly relevant for ephemeral workloads where there is a high volume of configuration changes for EC2 resource types such as the case here.

In the meantime, you can elect which resources AWS Config records.

Persistent EBS Volumes

Setting a Reclaim policy and the kube-janitor are two options available to managing the EBS volumes that are used for persistent volumes.

The wrap up

As our usage of Kubernetes grows and cost insights deepen, we continue to look to improving value. Two of the areas that we are working to further improve on is increasing the overall resource utilization of the nodes and adapting the templates used to set up the clusters to adapt the monitoring tools depending on the environment to tune the pods used for monitoring tooling.

YNAP Tech

Welcome to our dedicated channel for Everything Technology…