Real-life AWS infrastructure cost optimization strategy

How we keep a good Amazon Web Services billing hygiene at Teads

Damien Pacaud
Oct 30, 2017 · 9 min read
Image for post
Image for post

AWS recently announced its new per second billing for its EC2 instances and EBS volumes. This is perfect timing to talk about cost optimization. After a short intro we will guide you through some real world examples and best practices that we use at Teads to optimize our infrastructure costs.

The cloud computing opportunity and its traps

Most companies migrating to the cloud embrace the “lift & shift” strategy, replicating what was once on premises.

You most likely won’t save a penny with this first step.

Main reasons being:

  • Your applications do not support elasticity yet,
  • Your applications rely on complex backend you need to migrate with (RabbitMQ, Cassandra, Galera clusters, etc.),
  • Your code relies on being executed in a known network environment and most likely uses NFS as distributed storage mechanism.

Once in the cloud, you need to “cloudify” your infrastructure.

Then, and only then, will you have access to virtually infinite computing power and storage.

Watch out, this apparent freedom can lead to very serious drifts: over provisioning, under optimizing your code or even forgetting to “turn off the lights” by letting that small PoC run more than necessary using that very nice r3.8xlarge instance.

Essentially, you have just replaced your need for capacity planning by a need for cost monitoring and optimization.

The dark side of cloud computing

One of our biggest pain today with our cloud providers is the complexity of their pricing.

It is designed to look very simple at the first glance (usually based on simple metrics like $/GB/month or $/hour or, more recently, $/second) but as you expand and go into a multi-region infrastructure mixing lots of products, you will have a hard time tracking the ever-growing cost of your cloud infrastructure.

For example, the cost of putting a file on S3 and serving it from there includes four different lines of billing:

  • Actual storage cost (80% of your bill)
  • Cost of the HTTP PUT request (2% of your bill)
  • Cost of the many HTTP GET requests (3% of your bill)
  • Cost of the data transfer (15% of your bill)

Our take on Cost Optimization

  • Everyone is responsible - Provide tooling to each team to make them autonomous on their cost optimization.

The limit of cost optimization for us is when it drives more complexity in the code and less agility in the future, for a limited ROI.
This way of thinking also helps us to tackle cost optimisation in our day to day developments.

Overall we can extend this famous quote from Kent Beck:

“Make it work, make it right, make it fast” … and then cost efficient.

Billing Hygiene

In some cases, it will help you identify suspicious uptrends, like a service stuck in a loop and writing a huge volume of logs to S3 or a developer that left its test infrastructure up & running during a week-end.

You need to arm yourself with a detailed monitoring of your costs and spend time looking at it every day.

You have several options to do so, starting with AWS’s own tools:

  • Billing Dashboard, giving a high level view of your main costs (Amazon S3, Amazon EC2, etc.) and a rarely accurate forecast, at least for us. Overall, it’s not detailed enough to be of use for serious monitoring.
  • Detailed Billing Report, this feature has to be enabled in your account preferences. It sends you a daily gzipped .csv file containing one line per billable item since the beginning of the month (e.g., instance A sent X Mb of data on the Internet).
    The detailed billing is an interesting source of data once you have added custom tags to your services so that you can group your costs by feature / application / part of your infrastructure.
    Be aware that this file is accurate within a delay of approximately two days as it takes time for AWS to compute the files.
    UPDATE (June ‘18) Detailed Billing is officially deprecated, use the Cost and Usage Report instead.
  • Trusted Advisor, available at the business and enterprise support level, also includes a cost section with interesting optimization insights.
Image for post
Image for post
Trusted Advisor cost section - Courtesy of AWS
  • Cost Explorer, an interesting tool since its update in august 2017. It can be used to quickly identify trends but it is still limited as you cannot build complete dashboards with it. It is mainly a reporting tool.
Image for post
Image for post
Example of a Cost Explorer report — AWS documentation

Then you have several other external options to monitor the costs of your infrastructure:

  • SaaS products like Cloudyn / Cloudhealth. These solutions are really well made and will tell you how to optimize your infrastructure. Their pricing model is based on a percentage of your annual AWS bill, not on the savings that the tools will help you make, which was a show stopper for us.
  • The open source project Ice, initially developed by Netflix for their own use. Recently, the leadership of this project was transferred to the french startup Teevity who is also offering a SaaS version for a fixed fee. This could be a great option as it also handles GCP and Azure.

Building our own monitoring solution

We built a small Lambda function that ingests the detailed billing file into Redshift every day. This tool helps us slice and dice our data along numerous dimensions to dive deeper into our costs. We also use it to spot suspicious usage uptrends, down to the service level.

Image for post
Image for post
This is an example of our daily dashboard built with chart.io, each color corresponds to a service we tagged
Image for post
Image for post
When zoomed on a specific service, we can quickly figure out what is expensive

On top of that, we still use a spreadsheet to integrate the reservation upfronts in order to get a complete overview and the full daily costs.

Now that we have the data, how to optimize?

1 - Reserved Instances (RIs)

At Teads our reservation strategy is based on bi-annual reservation batches and we are also evaluating higher frequencies (3 to 4 batches per year).

The right frequency should be determined by the best compromise between flexibility (handling growth, having leaner financial streams) and the ability to manage the reservations efficiently.
In the end, managing reservations is a time consuming task.

Reservation is mostly a financial tool, you commit to pay for resources during 1 or 3 years and get a discount over the on-demand price:

  • You have two types of reservations, standard or convertible. Convertible lets you change the instance family but comes with a smaller discount compared to standard (avg. 75% vs 54% for a convertible). They are the best option to leverage future instance families in the long run.
  • Reservations come with three different payment options: Full Upfront, Partial Upfront, and No Upfront. With partial and no upfront, you pay the remaining balance monthly over the term. We prefer partial upfront since the discount rate is really close to the full upfront one (e.g. 56% vs 55% for a convertible 3-year term with partial).
  • Don’t forget that you can reserve a lot of things and not only Amazon EC2 instances: Amazon RDS, Amazon Elasticache, Amazon Redshift, Amazon DynamoDB, etc.

2 - Optimize Amazon S3

The Object Lifecycle option enables you to set simple rules for objects in a bucket :

  • Infrequent Access Storage (IAS): for application logs, set the object storage class to Infrequent Access Storage after a few days.
    IAS will cut the storage cost by a factor of two but comes with a higher cost for requests.
    The main drawback of IAS is that it uses 128kb blocks to store data so if you want to store a lot of smaller objects it will end up more expensive than standard storage.
  • Glacier: Amazon Glacier is a very long term archiving service, also called cold storage.
    Here is a nice article from Cloudability if you want to dig deeper into optimizing storage costs and compare the different options.

Also, don’t forget to set up a delete policy when you think you won’t need those files anymore.

Finally, enabling a VPC Endpoint for your Amazon S3 buckets will suppress the data transfer costs between Amazon S3 and your instances.

3 - Leverage the Spot market

Spot instances are bought using some sort of auction model, if your bid is above the spot market rate you will get the instance and only pay the market price. However these instances can be reclaimed if the market price exceeds your bid.

At Teads, we usually bid the on-demand price to be sure that we can get the instance. We only pay the “market” rate which gives us a rebate up to 90%.

It is worth noting that:

  • You get a 2 min termination notice before your spot is reclaimed but you need to look for it.
  • Spot Instances are easy to use for non critical batch workloads and interesting for data processing, it’s a very good match with Amazon Elastic Map Reduce.

4 - Data transfer

Whatever data you sent through that link was free of charge.

In the cloud, data transfer can grow to become really expensive.

You are charged for data transfer from your services to the Internet but also in-between AWS Availability Zones.

This can quickly become an issue when using distributed systems like Kafka and Cassandra that need to be deployed in different zones to be highly available and constantly exchange over the network.

Some advice:

  • If you have instances communicating with each other, you should try to locate them in the same AZ
  • Use managed services like Amazon DynamoDB or Amazon RDS as their inter-AZ replication costs is built-in their pricing
  • If you serve more than a few hundred Terabytes per months you should discuss with your account manager
  • Use Amazon CloudFront (AWS’s CDN) as much as you can when serving static files. The data transfer out rates are cheaper from CloudFront and free between CloudFront and EC2 or S3.

5 - Unused infrastructure

  • Detached Elastic IPs (EIPs), they are free when attached to an EC2 instance but you have to pay for it if they are not.
  • The block stores (EBS) starting with the EC2 instances are preserved when you stop your instances. As you will rarely re-attach a root EBS volume you can delete them. Also, snapshots tend to pile up over time, you should also look into it.
  • A Load Balancer (ELB) with no traffic is easy to detect and obviously useless. Still, it will cost you ~20 $/month.
  • Instances with no network activity over the last week. In a cloud context it doesn’t make a lot of sense.

Trusted Advisor can help you in detecting these unnecessary expenses.

Key takeaways

Image for post
Image for post

Thank you for reading. This article was inspired by the talks I made during the #2 AWS Montpellier Meetup and Devops D-Day conference.

Image for post
Image for post
Devops D-Day 2017 — Marseille

If you like working on big cloud infrastructures and growth challenges, feel free to contact us, we are constantly looking for great teammates.

If you want to know more about Engineering at Teads:

Teads Engineering

150+ innovators building the future of digital advertising

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store