Real-life AWS infrastructure cost optimization strategy
How we keep a good Amazon Web Services billing hygiene at Teads
AWS recently announced its new per second billing for its EC2 instances and EBS volumes. This is perfect timing to talk about cost optimization. After a short intro we will guide you through some real world examples and best practices that we use at Teads to optimize our infrastructure costs.
The cloud computing opportunity and its traps
One of the advantages of cloud computing is its ability to fit the infrastructure to your needs, you only pay for what you really use. That is how most hyper growth startups have managed their incredible ascents.
Most companies migrating to the cloud embrace the “lift & shift” strategy, replicating what was once on premises.
You most likely won’t save a penny with this first step.
Main reasons being:
- Your applications do not support elasticity yet,
- Your applications rely on complex backend you need to migrate with (RabbitMQ, Cassandra, Galera clusters, etc.),
- Your code relies on being executed in a known network environment and most likely uses NFS as distributed storage mechanism.
Once in the cloud, you need to “cloudify” your infrastructure.
Then, and only then, will you have access to virtually infinite computing power and storage.
Watch out, this apparent freedom can lead to very serious drifts: over provisioning, under optimizing your code or even forgetting to “turn off the lights” by letting that small PoC run more than necessary using that very nice r3.8xlarge instance.
Essentially, you have just replaced your need for capacity planning by a need for cost monitoring and optimization.
The dark side of cloud computing
At Teads we were “born in the cloud” and we are very happy about it.
One of our biggest pain today with our cloud providers is the complexity of their pricing.
It is designed to look very simple at the first glance (usually based on simple metrics like $/GB/month or $/hour or, more recently, $/second) but as you expand and go into a multi-region infrastructure mixing lots of products, you will have a hard time tracking the ever-growing cost of your cloud infrastructure.
For example, the cost of putting a file on S3 and serving it from there includes four different lines of billing:
- Actual storage cost (80% of your bill)
- Cost of the HTTP PUT request (2% of your bill)
- Cost of the many HTTP GET requests (3% of your bill)
- Cost of the data transfer (15% of your bill)
Our take on Cost Optimization
- Focus on structural costs - Never block short term costs increase that would speed up the business, or enable a technical migration.
- Everyone is responsible - Provide tooling to each team to make them autonomous on their cost optimization.
The limit of cost optimization for us is when it drives more complexity in the code and less agility in the future, for a limited ROI.
This way of thinking also helps us to tackle cost optimisation in our day to day developments.
Overall we can extend this famous quote from Kent Beck:
“Make it work, make it right, make it fast” … and then cost efficient.
It is of the utmost importance to keep a strict billing hygiene and know your daily spends.
In some cases, it will help you identify suspicious uptrends, like a service stuck in a loop and writing a huge volume of logs to S3 or a developer that left its test infrastructure up & running during a week-end.
You need to arm yourself with a detailed monitoring of your costs and spend time looking at it every day.
You have several options to do so, starting with AWS’s own tools:
- Billing Dashboard, giving a high level view of your main costs (Amazon S3, Amazon EC2, etc.) and a rarely accurate forecast, at least for us. Overall, it’s not detailed enough to be of use for serious monitoring.
- Detailed Billing Report, this feature has to be enabled in your account preferences. It sends you a daily gzipped .csv file containing one line per billable item since the beginning of the month (e.g., instance A sent X Mb of data on the Internet).
The detailed billing is an interesting source of data once you have added custom tags to your services so that you can group your costs by feature / application / part of your infrastructure.
Be aware that this file is accurate within a delay of approximately two days as it takes time for AWS to compute the files.
UPDATE (June ‘18) Detailed Billing is officially deprecated, use the Cost and Usage Report instead.
- Trusted Advisor, available at the business and enterprise support level, also includes a cost section with interesting optimization insights.
- Cost Explorer, an interesting tool since its update in august 2017. It can be used to quickly identify trends but it is still limited as you cannot build complete dashboards with it. It is mainly a reporting tool.
Then you have several other external options to monitor the costs of your infrastructure:
- SaaS products like Cloudyn / Cloudhealth. These solutions are really well made and will tell you how to optimize your infrastructure. Their pricing model is based on a percentage of your annual AWS bill, not on the savings that the tools will help you make, which was a show stopper for us.
- The open source project Ice, initially developed by Netflix for their own use. Recently, the leadership of this project was transferred to the french startup Teevity who is also offering a SaaS version for a fixed fee. This could be a great option as it also handles GCP and Azure.
Building our own monitoring solution
At Teads we decided to go DIY using the detailed billings files.
We built a small Lambda function that ingests the detailed billing file into Redshift every day. This tool helps us slice and dice our data along numerous dimensions to dive deeper into our costs. We also use it to spot suspicious usage uptrends, down to the service level.
On top of that, we still use a spreadsheet to integrate the reservation upfronts in order to get a complete overview and the full daily costs.
Now that we have the data, how to optimize?
Here are the 5 pillars of our cost optimization strategy.
1 - Reserved Instances (RIs)
First things first, you need to reserve your instances. Technically speaking, RIs will only make sure that you have access to the reserved resources.
At Teads our reservation strategy is based on bi-annual reservation batches and we are also evaluating higher frequencies (3 to 4 batches per year).
The right frequency should be determined by the best compromise between flexibility (handling growth, having leaner financial streams) and the ability to manage the reservations efficiently.
In the end, managing reservations is a time consuming task.
Reservation is mostly a financial tool, you commit to pay for resources during 1 or 3 years and get a discount over the on-demand price:
- You have two types of reservations, standard or convertible. Convertible lets you change the instance family but comes with a smaller discount compared to standard (avg. 75% vs 54% for a convertible). They are the best option to leverage future instance families in the long run.
- Reservations come with three different payment options: Full Upfront, Partial Upfront, and No Upfront. With partial and no upfront, you pay the remaining balance monthly over the term. We prefer partial upfront since the discount rate is really close to the full upfront one (e.g. 56% vs 55% for a convertible 3-year term with partial).
- Don’t forget that you can reserve a lot of things and not only Amazon EC2 instances: Amazon RDS, Amazon Elasticache, Amazon Redshift, Amazon DynamoDB, etc.
2 - Optimize Amazon S3
The second source of optimization is the object management on S3. Storage is cheap and infinite, but it is not a valid reason to keep all your data there forever. Many companies do not clean their data on S3, even though several trivial mechanisms could be used:
The Object Lifecycle option enables you to set simple rules for objects in a bucket :
- Infrequent Access Storage (IAS): for application logs, set the object storage class to Infrequent Access Storage after a few days.
IAS will cut the storage cost by a factor of two but comes with a higher cost for requests.
The main drawback of IAS is that it uses 128kb blocks to store data so if you want to store a lot of smaller objects it will end up more expensive than standard storage.
- Glacier: Amazon Glacier is a very long term archiving service, also called cold storage.
Here is a nice article from Cloudability if you want to dig deeper into optimizing storage costs and compare the different options.
Also, don’t forget to set up a delete policy when you think you won’t need those files anymore.
Finally, enabling a VPC Endpoint for your Amazon S3 buckets will suppress the data transfer costs between Amazon S3 and your instances.
3 - Leverage the Spot market
Spot instances enables you to use AWS’s spare computing power at a heavily discounted price. This can be very interesting depending on your workloads.
Spot instances are bought using some sort of auction model, if your bid is above the spot market rate you will get the instance and only pay the market price. However these instances can be reclaimed if the market price exceeds your bid.
At Teads, we usually bid the on-demand price to be sure that we can get the instance. We only pay the “market” rate which gives us a rebate up to 90%.
It is worth noting that:
- You get a 2 min termination notice before your spot is reclaimed but you need to look for it.
- Spot Instances are easy to use for non critical batch workloads and interesting for data processing, it’s a very good match with Amazon Elastic Map Reduce.
4 - Data transfer
Back in the physical world, you were used to pay for the network link between your Data Center and the Internet.
Whatever data you sent through that link was free of charge.
In the cloud, data transfer can grow to become really expensive.
You are charged for data transfer from your services to the Internet but also in-between AWS Availability Zones.
This can quickly become an issue when using distributed systems like Kafka and Cassandra that need to be deployed in different zones to be highly available and constantly exchange over the network.
- If you have instances communicating with each other, you should try to locate them in the same AZ
- Use managed services like Amazon DynamoDB or Amazon RDS as their inter-AZ replication costs is built-in their pricing
- If you serve more than a few hundred Terabytes per months you should discuss with your account manager
- Use Amazon CloudFront (AWS’s CDN) as much as you can when serving static files. The data transfer out rates are cheaper from CloudFront and free between CloudFront and EC2 or S3.
5 - Unused infrastructure
With a growing infrastructure, you can rapidly forget to turn off unused and idle things:
- Detached Elastic IPs (EIPs), they are free when attached to an EC2 instance but you have to pay for it if they are not.
- The block stores (EBS) starting with the EC2 instances are preserved when you stop your instances. As you will rarely re-attach a root EBS volume you can delete them. Also, snapshots tend to pile up over time, you should also look into it.
- A Load Balancer (ELB) with no traffic is easy to detect and obviously useless. Still, it will cost you ~20 $/month.
- Instances with no network activity over the last week. In a cloud context it doesn’t make a lot of sense.
Trusted Advisor can help you in detecting these unnecessary expenses.
If you want to know more about Engineering at Teads: