AWS Cost Optimisation
Published in
3 min readJun 1, 2024
Key routes/takeaways on slimming the bill
Things to keep in mind:
- You have to always always ask, “ is this worth it” — it shouldn’t be “for the time being”
- Make sure you understand your application and developmental requirements and accordingly make tradeoffs. Optimising is a fine line away from being a foolish miser — that fine line is called an outage.
- Know what part of your system can tolerate faults. Not everything needs to be accounted for. Non prod workloads have an appetite for data loss. Logging solutions can tolerate a downtime. Read latencies from buffers won’t immediately break the system.
- Reservations on AWS are super cost efficient. [WAIT]. Answer whether you need the system for the next 365days & an absolute YES is the only go ahead. AWS allows no backsies.
- Go 80–20. Address the low handing fruits before you come up with architectural, code or breaking infrastructure changes.
- You need not solve every problem — moreover systems scale & as long as the spendings are inline with the business growth, you’re doing great buddy!
Logging/Monitoring
- Cloudwatch Log Data retention: Make sure you only keep what you need and expire them after. AWS defaults logs to never expire.
- Opensearch: use lifecycle policies to move data to cheaper storage options (add hot->warm->cold storage lifecycle)
- Avoid large queries: By default DENY running cloudwatch data queries for all users on a large single index. You will need to add rules to your account/org wide roles.
- Moving out of a managed solution not only gives more control but also gives you cheaper solutions. For example: spend sometime on developing some expertise on ELK instead of going with AWS Opensearch
- Standards: pick logging libraries and over time mandate using it across all applications as a best practice
- Optimising at source — account for runaway conditions and develop better microservices
- Alarms: If a certain log group or index is growing rapidly, report it.
Storage
- Do not put everything to autoscale and sleep peacefully. The runaway conditions are just going to choke your budget.
- Know your peak load and optimise on how much to provision/over-provision. Having a headroom of 20–30% over your peak traffic requirements is okay. Going 2x of peak is stupid.
- Use GP3 type volumes — cheaper and give higher throughput
- Cleanup all dangling volumes and public IPs
- You don’t need all data since inception — know what to backup and for how long
- Move data from S3 Standard to Glacier using lifecycle policies
- Cleanup DynamoDb data by employing a TTL(time-to-live) field
Compute
- Pick the latest in the family
- Use graviton instance types
- Switch from OnDemand to Spot
- Use instance reservations
- Buy savings plans
- Use Lambda when it’s an on-demand/scheduled run
Databases
- Avoid replicas, snapshots and multi-az deployments of non-prod workloads
- Know what to snapshot and how long to keep them
- Vaccum for disk space over up-scaling
- Try to stay as close to the engine LTS version for most efficient usage
- Pick instance types based on your use case — know if you are solving for latency or storage efficiency or something else.
HELP yourself by:
- Having monitors in place to inform you of runaway scenarios
- Automate cleanup of unused resources (volumes, ec2 instances, public ips etc.)
- Tag everything- answers operational questions like who owns what, what to keep, what to clean etc.
- Regularly monitor costs (lack of visibility creates bigger bills than bad engineering)
- Set a budget with AWS so that anything crossing higher watermarks come to notice
- Educate teams about best practices and lead by examples
- Have efficient IaC in place and avoid making exceptions