Wanna save some bucks in Infra stuff, uh? Cost-Saving strategies on AWS and Kubernetes

Published in

adidoescode

13 min readNov 3, 2023

In large companies where multiple major teams collaborate, the economic costs we incur are often overlooked, as we assume they are already accounted for in the budget by some “high-level sheriff”, or we don’t even think about them at all.
It happens. Resources are most times oversized “just in case”, or not properly seized at all. But this initial sizing is rarely reviewed.

This way, these seemingly minor expenses can linger indefinitely, adding to a multitude of other inconsequential costs. Over time, they contribute to an inflated and unnecessary bill.

In today’s rapidly evolving digital landscape, managing expenses has become a top priority. It’s crucial to stay proactive and ensure your infra infrastructure costs are under control. Let me provide you a list of steps we have applied in our projects to significantly reduce our billing.

DevOps →DevSecOps → DevSecFinOps… What’s next?

World of software development has gone through some impressive transformations. First, we had DevOps, where Development and Operations joined forces and created a kind of “philosophical movement”. Then, Security evolved and became DevSecOps.

But wait, there’s more! Now Cost Saving and Finances have become first-level citizens, so we have DevSecFinOps. It’s time to recognize that keeping those expenses in check is as crucial as ever. We need to infuse cost control into the very essence of our teams and processes, as part of every team’s DNA. In this exciting era of DevSecFinOps, minding the budget is a game-changer that should impact our software practices.

The Scenario

In my current project, we DevOps team are working giving horizontal support to a set of different projects that work together but are managed and developed as isolated components. For each of them, we have three different environments in use: Development (DEV), System Integration Testing (SIT), User Acceptance Testing (UAT), and Production (PRD). Each environment has its own dedicated namespace. Additionally, we use RDS Aurora databases for each one.

A couple of months ago, while reviewing our AWS DB performance, we realized that they were heavily underused. CPU was barely reaching 20% peaks, and so was Memory, and a number of connections.

Curious to see where else we might be wasting resources; we also checked the resource utilization in Kubernetes. Numbers were not better.
We decided then to implement some “quick win” actions, to save as much as possible and as quick as possible.

Enough. Time for “Uncle Scrooge Mode”.

Trace where you are!

Before you start applying changes, would be better if you take note of how much your infra is taking. Knowing from where you come it is easy to measure the impact of the different actions you can apply. For AWS-based components (DBs or even EKS ) you can use AWS Cost Explorer.

AWScost Explorer is a detailed dashboard for your costs, where you can filter, split based in different periods, services, types or instance, taxes (ouch!) or even your own tags (e.g., “dev-infra”, “prod-infra” …)

For Kubernetes cluster costs, in our case we have a dedicated team that manages cluster and already provides this info based in cost center and project. But if this is not an option in your team or company, you can always make use of some solution such as kubecost, or worst case you could get your resource consumption from Grafana and just try to minimize it if the specific real cost is not easy to get.

Once you are able to understand how the landscape works, let’s begin with a saving plan:

“Uncle Scrooge Plan”

The idea is to act ASAP in “quick wins”, those modifications that can be done with less effort and will mean the most savings. An application of Pareto’s law could be found here, as probably 80% of your costs came from the 20% of your infra. Thanks to AWS Cost Explorer, it is quite easy to detect the hole in your pocket… In our case, it was AWS Aurora DBs.

Same time, you can probably act in parallel in other aspects of the cost landscape as k8s cluster, depending on how much effort can you dedicate to it.

Good part is that it is not hard to align all the team in order to prioritize this kind of money issues, moreover if your numbers are extremely bad…

This is the specific list of actions we have followed (as said, Phases can be acted in parallel)

AWS RDS Optimisation
- Database Rightsizing
- Use of Read Replicas and pools
- DB Weekend Shutdown
- Reserved DB Instances
K8S Optimisation
- Container Right-Sizing
- Replicas & HPA tuning
- K8S Weekend Shutdown

Of course, this is our roadmap, feel free to design yours.

AWS RDS Optimisation

In our case, we’re using Aurora MySQL RDS instances. As said, they were created in the beginning with not a clear idea of accurate use (in fact, oversized “just in case”). The actions we have defined in order to reduce Aurora costs are as follows:

Database Rightsizing

To assure you’re using the proper size for your database, you can check AWS metrics directly on AWS console. CPU, Memory, and a number of connections must be in reasonable limits. If the instance you’re using is underused, it is time to downgrade. Be sure that the team is not expecting a load increase in the future. As always, balance is the key.
Here, the use of Cloudwatch metrics and Performance Insights will provide you the proper landscape.
You should check official documentation for number of connections and instance types.

Important: Our scenario assumes a regular and predictable system load. If you have to deal with unexpected peaks and irregular loads, probably your best bet is to use some kind of DB escalation.

As a reference, please check the capacity (and price) of Aurora machines. Make sure to select proper region!

Use of Read Replicas and Connection Pools

For PRD and UAT environments, we use Read Replicas, that initially were used just as Failover. That means, they were used just as backup database that automatically takes over in case the primary database fails.
This is needed (mostly in PRD) for obvious reasons, but… could we take advantage of our Read Replica instead of having it idle, just in case?
Answer is yes.
In fact, when you deploy an Aurora DB, AWS provides you 2 different endpoints: Write and Read. These endpoints are available regardless of having just one Writer instance, or one Writer and several ReadReplicas.

We decided to make use of Read endpoint for Read operations. It had no effect in DEV / SIT, as we had only one instance. But in STG / PRD, we spread the load and required connections just redirecting Read operations to Read endpoint. This way we relieved Writer instance, that allowed us to reduce the size of instances.

Be careful here: if you have a single instance (Writer) it works regardless the endpoint you use. But with ReadReplicas… You cannot invoke Write operations with Read endpoint: it will fail.

Another thing we tuned was the size of connection pools from our Spring Boot backend components. Originally, we set same size for all environments, but in lower ones we don´t need to have same size than in PRD. That helped us to reach even a smaller DB size.

DB Weekend Shutdown

Weekend is for chilling, mate…
And so for your databases. Unless you’re unlucky enough to have to work during the weekend, your databases will be better off shut down.

For lower environments, you need to first assure if there are processes running during weekend (QA, data migrations…) in order to reschedule them.
Once the field is clear, you can start scheduling your weekend shutdown. There are several approaches here. You can prepare it pure-AWS using AWS Instance Scheduler , that is perfectly valid. However, you will need to access your AWS account for modifications or if, for any reason, you need to unset a shutdown because Team Lightning has to deploy a delayed feature on Monday.
We have defined a different approach, based on cron-scheduled Jenkins jobs. In fact, the same Jenkins job is shutting down not only the DBs but the pods (will go deeper on this in next steps).

To accomplish this we have defined two jobs, one raised Friday Evening for shutting down, and other Monday morning for restarting. Regarding DB we use an AWS CLI account with specific permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowClusterControl",
            "Effect": "Allow",
            "Action": [
                "rds:StopDBCluster",
                "rds:StartDBCluster",
                "rds:DescribeDBClusters"
            ],
            "Resource": [
                "arn:aws:rds:myregion:1234567890:cluster:my-cluster-1",
                "arn:aws:rds:myregion:1234567890:cluster:my-cluster-2"
            ]
        }
    ]
}

Then, using AWS Credentials Plugin, you will be able to connect to your instance.
What the Shutdown Job code does it is as easy as shutting down the instance:

aws rds stop-db-cluster --db-cluster-identifier=${clusterId} --region=${awsRegion}

and then checking the cluster_status in a loop for several minutes until status is “stopped”:

aws rds describe-db-clusters --db-cluster-identifier ${clusterId} --region=${awsRegion} --query 'DBClusters[0].Status' --output text", returnStdout: true).trim()

And similar operation is done by the Restart Job, starting the cluster and waiting until status is “available”.
Using Jenkins Jobs for this helps our Developer teams to be more independent in case they need to set env up — down with no need to bother sleeping DevOps guy… ;)

But, if you are already thinking about applying Weekend Shutdown, hold for a while. Things can be still better… keep on reading about Reserved Instances…!

Reserved DB Instances

Once you are comfortable with your DB size and instance, AWS offers you two kind of DB instances depending on the way of paying: On Demand or Reserved instances.
On Demand means you pay for the instance while you’re using it, in hourly basis. This is the default when you raise a new instance.
Reserved means that you get the commitment of keeping the instance for a period of time (1 year or 3 years, typically), paying all the cost or part of it in advance, getting a succulent discount in exchange. The more you pay in advantage, the bigger discount you get.

Reserving an instance has no impact in it, no downtime, no endpoint change… It is just you pay in advance, that’s all.
Keep in mind that if you go for reserved instances, DB weekend shutdown could not be needed as you have already paid for the instance…

Make your numbers: As we’re working in a worldwide distributed team, makes no sense to shutdown environments in daily basis due to the different time zones.
If this is your case and you can do it in daily basis, please evaluate whether makes sense to apply shutdowns or to reserve instances. For us, it is more advantageous to reserve, but applying both weekend & daily shutdown, numbers could be different.

K8S Optimisation

CPU and Memory for pods must be shaped too. Usually, we tend to assign plenty of resources with the unlikely intention of “adjust it later”. You and I both know that this will be perceived as “technical debt” at best, or more likely, never addressed… So now is the moment to take it.
The goal is to reduce the amount of resources to minimum on DEV / SIT environments, then apply it in STG, perform load testing to adjust, and then apply same on PRD.

Basically, there are two ways where you can act to restrict your k8s costs:
- The resources of each component
- The number of replicas per component

Container Rightsizing

Further than old reliable “kubectl top pod”, we use Grafana/Prometheus to get metrics of the use of each component. There are already a lot of references out there.
There are third party solutions too, as Datadog, Sysdig or Kubecost too.

Grafana showing us some underused resources… Isn´t this sad? — Grafana showing us some underused resources… Isn´t it sad?

As a reminder, Resource requests indicate the minimum amount of resources required, while Limits define the maximum allowed allocation. This helps Kubernetes scheduler make informed decisions while placing and scheduling pods.

Some hints here to keep in mind:

Ideally, we should not create quite big pods. It is better to have a higher number of smaller pods instead, so existing pods can be balanced better when facing pikes or low load periods. One of the advantages of Kubernetes is that is responsive to load, and you can escalate more pods when needed: Use HPA (Horizontal Pod Autoscaling)
Determine the resource requirements of your workloads by examining their historical usage patterns. Check with developer team or architects to identify any resource-intensive applications or service that might require higher resource limits.
In case of Kafka consumers, we have to follow the main rule:
Nº consumers ≤ Nº Topic Partitions
In other words, each partition will be handled by one (and just one) consumer. If you create more, they will be idle! You can define an HPA, but MAX should never be bigger than Partition number.

In a nutshell, adjust your pod resources based in real use, keep replicas as low as possible (while this is not affecting development!) and use HPA tuning when possible.
As said before, in STG we should check our savings under load tests, so we can be sure our changes are good enough to be PRD promoted.

K8S Weekend Shutdown

Exactly the same idea on Database Weekend Shutdown can be applied to your cluster. In fact, you can use same Jenkins job for both.
For shutdown, first turn off pods and then DB, and opposite order for Monday restart.
In this case, what we do is as simple as scale pods to 0, and then rescale them to last number of pods they were (not taking deployment replicas number but taking “last-applied-configuration” annotation. So, if developers scaled some deployment for whatever reason, we reset same scenario).

Kubectl command to scale to 0:

kubectl get deployments -o name | xargs -I {} kubectl --namespace=$environment.namespace scale {} --replicas=0

Code simplified to upscale to previous status:

   steps.sh """
       for deployment in \$(kubectl get deployments -o name); do 
       
           replicas=\$(kubectl get \${deployment} -o=jsonpath='{.metadata.annotations.kubectl\\.kubernetes\\.io/last-applied-configuration}' | jq -r '.spec.replicas' ) 
       
           if [ -n "\${replicas}" ] && [ "\${replicas}" != "0" ] && [ "\${replicas}" != "null" ]; then
               kubectl scale --replicas=\${replicas} \${deployment}  
           else
               echo "====================================="
               echo "Skipping scale for \${deployment} due to missing or zero replicas"
               echo "====================================="
           fi
       done 
   """

There are some alternatives and tools that will help you to do this too with no need to involve Jenkins jobs:
kube-downscaler allows you to do something similar using annotations:

apiVersion: v1
kind: Namespace
metadata:
    name: foo
    labels:
        name: foo
    annotations:
        downscaler/uptime: Mon-Sun 07:30-18:00 CET

This is a cool alternative too. From my side, I prefer to manage both DB and k8s together and give teams control if they need to launch the environment by themselves.

And those are basically the straightforward “quick wins” we have applied.
But, wait, there are more ideas to be investigated…

Serverless ideas… for Penniless Teams!

Serverless is in everyone’s lips.
The idea is great: Pay for those you consume. Just raise the resources you need for this specific task, and not more.
Of course, this is not a silver bullet for all problems, but please check your landscape. Do you have components raised all the time for short, not frequent tasks? Then probably you should explore serverless.

It mostly fits for:
- Event-driven tasks, that could be not frequent, or even bursty or unpredictable, (as serverless will scale with no downtime)
- Short duration and scalable tasks
- Stateless tasks where status must not be stored
- Tasks where low latency or real-time is required

Said in other words, predictable, frequent, and regular tasks, taking a long time, and / or stateful, are probably out of scope and could be better to keep a dedicated resource for them.

Beware your pocket: Keep alert!

Just like keeping up with household chores, monitoring costs on AWS RDS and Kubernetes is a never-ending task... It’s crucial to stay on top of your spending to avoid surprises and optimize your resources. Here are a few quick tips:

Keep an eye on resource utilization: Make sure you’re not paying for more than you need. Adjust your RDS instance sizes or Kubernetes pod configurations to match your actual requirements.
Embrace the power of automation: Set up autoscaling for your databases and clusters. Let your infrastructure flexibly adapt to the workload, saving you from unnecessary expenses during low-traffic periods.
Tag it right: Use cost allocation tags to categorize your resources. It helps you identify where the costs are coming from, making it easier to allocate expenses accurately. e.g, in AWS Cost Explorer you can filter resources by tags, so you can easily check what is coming from DEV or from PRD components.
Don’t overlook reserved instances: If you have predictable workloads, consider purchasing Reserved Instances on AWS or Savings Plans for Kubernetes. They can provide substantial cost savings in the long run.
Get friendly with monitoring tools: Leverage tools like AWS Cost Explorer, Prometheus, Grafana, or third-party options to track your spending, analyze usage patterns, and identify opportunities for optimization.

By staying vigilant and implementing these tips, you’ll keep your AWS RDS and Kubernetes costs in check while ensuring your cloud infrastructure is both efficient and budget friendly.

Remember, even small adjustments can add up to significant savings over time. So, start taking proactive steps today and enjoy the benefits of a leaner and more cost-efficient operation!

The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.