AWS Cost Optimisation

Published in

The Tech Matter

3 min readJun 1, 2024

Key routes/takeaways on slimming the bill

Things to keep in mind:

You have to always always ask, “ is this worth it” — it shouldn’t be “for the time being”
Make sure you understand your application and developmental requirements and accordingly make tradeoffs. Optimising is a fine line away from being a foolish miser — that fine line is called an outage.
Know what part of your system can tolerate faults. Not everything needs to be accounted for. Non prod workloads have an appetite for data loss. Logging solutions can tolerate a downtime. Read latencies from buffers won’t immediately break the system.
Reservations on AWS are super cost efficient. [WAIT]. Answer whether you need the system for the next 365days & an absolute YES is the only go ahead. AWS allows no backsies.
Go 80–20. Address the low handing fruits before you come up with architectural, code or breaking infrastructure changes.
You need not solve every problem — moreover systems scale & as long as the spendings are inline with the business growth, you’re doing great buddy!

Cloudwatch Log Data retention: Make sure you only keep what you need and expire them after. AWS defaults logs to never expire.
Opensearch: use lifecycle policies to move data to cheaper storage options (add hot->warm->cold storage lifecycle)
Avoid large queries: By default DENY running cloudwatch data queries for all users on a large single index. You will need to add rules to your account/org wide roles.
Moving out of a managed solution not only gives more control but also gives you cheaper solutions. For example: spend sometime on developing some expertise on ELK instead of going with AWS Opensearch
Standards: pick logging libraries and over time mandate using it across all applications as a best practice
Optimising at source — account for runaway conditions and develop better microservices
Alarms: If a certain log group or index is growing rapidly, report it.

Do not put everything to autoscale and sleep peacefully. The runaway conditions are just going to choke your budget.
Know your peak load and optimise on how much to provision/over-provision. Having a headroom of 20–30% over your peak traffic requirements is okay. Going 2x of peak is stupid.
Use GP3 type volumes — cheaper and give higher throughput
Cleanup all dangling volumes and public IPs
You don’t need all data since inception — know what to backup and for how long
Move data from S3 Standard to Glacier using lifecycle policies
Cleanup DynamoDb data by employing a TTL(time-to-live) field

Avoid replicas, snapshots and multi-az deployments of non-prod workloads
Know what to snapshot and how long to keep them
Vaccum for disk space over up-scaling
Try to stay as close to the engine LTS version for most efficient usage
Pick instance types based on your use case — know if you are solving for latency or storage efficiency or something else.

Having monitors in place to inform you of runaway scenarios
Automate cleanup of unused resources (volumes, ec2 instances, public ips etc.)
Tag everything- answers operational questions like who owns what, what to keep, what to clean etc.
Regularly monitor costs (lack of visibility creates bigger bills than bad engineering)
Set a budget with AWS so that anything crossing higher watermarks come to notice
Educate teams about best practices and lead by examples
Have efficient IaC in place and avoid making exceptions