Running Cost Efficient EKS Clusters

Published in

YNAP Tech

5 min readNov 6, 2020

Browse one of our multi-brand online stores and it’s likely you will be served pages that were generated by one of our Kubernetes clusters. This post discusses the how the Kubernetes Platform team at YOOX NET-A-PORTER GROUP have built the clusters to run production and non-production workloads and be cost efficient.

The Kubernetes clusters run Amazon EKS with EC2 nodes. There is an hourly cost for the EKS cluster control plane plus the costs of the EC2 nodes (by the hour or second instance costs, EBS costs, data transfer costs, load balancing costs). As the EKS control plane cost is an hourly fee per cluster where the only lever we have is to not run the cluster, the area to focus on for is the EC2 nodes.

When EKS on Fargate was launched, we evaluated the costs of EC2 Spot vs Fargate. For the number of requests we serve and the 24x7 nature of the online stores, EC2 Spot is a better fit for our needs.

The two levers that are available are rates and usage where Cost = Rate x Usage.

Reducing Rates

There are five ways to pay for Amazon EC2 instances: On-Demand, Savings Plans, Reserved Instances, Spot Instances and Dedicated Hosts. Spot Instances allow a way to get the large discounts without needing to commit to 1 or 3 years with AWS.

The trade-off for the discount and flexibility is the risk of instances being interrupted by Amazon EC2, with two minutes of notification, when Amazon EC2 needs the capacity back. Due to this, Spot instances are not suitable for all types of workloads.

In the case of the YOOX NET-A-PORTER stores’ micro front-end, these are stateless so Spot instances can be used with the worker nodes in the EKS clusters.

Comparing what we were billed for Spot usage against on demand pricing, we are seeing between 47% and 66% saving using Spot with the average saving of 59%. Higher discounts can be attained on other instance types (particularly with some of the older generation instances).

Given that 3-year RIs or Savings Plans give up to 72% saving, achieving 66% with Spot instances for our usage without having any term or usage commitments is great.

Handling Spot Interruptions

Using aws-node-termination-handler we are able to gracefully drain applications off a node as soon as an interruption notice is received. To reduce the risk of a single family/size of EC2 Spot instances not being available in the region, Spot Fleet is used with a mix of instance families and sizes.

Managing Usage

The second lever to reducing overall costs is reducing usage — running fewer instances over fewer hours (and seconds in the case of EC2).

Scale down out of office hours

Our dev teams are based in similar timezones and only need to make use of non-production environments during working hours.

There are 168 hours in a week and if teams only need environments for 12 hours a day, 5 days a week (60 hours) a 24x7 environment is unused (wasted) for 64% of the time. With the utility based billing model of public cloud, this is cost that can be avoided.

With kube-downscaler the pods that don’t need to run 24x7 are scaled down.

If the nodes are left running idle once pods are scaled down then you don’t achieve any savings so autoscaler and descheduler come into play to evict pods that can be moved off an under-utilized node and adjusting the EC2 Auto Scaling Groups settings. In addition to the cost avoidance from the instance hours, you also get an avoidance on associated costs such as EBS.

Scaling in nodes outside working hours and scaling as required during the working day

Auto-delete branches on PR merge

We deploy all branches of environments as separate Helm releases for testing. As a result of that, we end up with a fairly large number of Helm releases on the cluster at any given point. Managing these and cleaning up helm releases and namespaces on PR merge is an important component of avoiding costs.

Beyond GitOps: How we release our Microservices on Kubernetes at YNAP has more details on how we release our applications into our clusters.

Things to watch out for

AWS Config Costs increase

As more teams started using EKS, we noticed that AWS Config costs were increasing especially in the non-production accounts. Our templates for non-productions clusters are created to run as cost effectively as they can so bin-packing and and downscaling is more aggressive in non-production clusters.

This results in lots of EC2 Auto Scaling Group changes, just look at the auto scaling chart above. Each time a node is scaled out or in, there are changes to the Auto Scaling Group values, EC2 launch/terminate events and updates to the security groups to add/remove the EC2 instance. AWS Config records all these configuration events and charge per configuration item recorded.

There is a change to how AWS Config records relationships coming in August 2021 (see Example : AWS Health event output section for the details) that will deprecate recording indirect relationships. This will reduce the number of configuration changes recorded and is particularly relevant for ephemeral workloads where there is a high volume of configuration changes for EC2 resource types such as the case here.

In the meantime, you can elect which resources AWS Config records.

Persistent EBS Volumes

Deleting a Kubernetes StatefulSet will not delete its persistent volumes (PVCs). This could result in EBS volumes being left running in an unattached state. This quickly ramps up to multiples of GBs, PBs, TBs of EBS provisioned storage (and possibly IOPS) costing thousands of dollars. Keeping these running is also bad for the planet, so managing them properly can make a difference in reducing our impact on the planet.

Setting a Reclaim policy and the kube-janitor are two options available to managing the EBS volumes that are used for persistent volumes.

The wrap up

In this fast-paced area of technology, things are constantly evolving in the tools, cloud provider offerings (serverless, new chipsets) and cloud pricing models. Whilst writing this, AWS released Capacity Rebalancing for EC2 Spot Instances which could lead us to change how we manage our Spot clusters.

As our usage of Kubernetes grows and cost insights deepen, we continue to look to improving value. Two of the areas that we are working to further improve on is increasing the overall resource utilization of the nodes and adapting the templates used to set up the clusters to adapt the monitoring tools depending on the environment to tune the pods used for monitoring tooling.