Do you really know what your cloud apps are costing you? 2/2

Published in

My Local Farmer Engineering

6 min readJun 8, 2021

Track down runaway cost in Cost Explorer

Now that we’ve set up a tagging practice and cost allocation tags, it’s time to use them and figure out what our AWS apps really cost, and keep tabs on any abnormal cost spikes draining our company’s resources.

Disclaimer
I Love My Local Farmer is a fictional company inspired by customer interactions with AWS Solutions Architects. Any stories told in this blog are not related to a specific customer. Similarities with any real companies, people, or situations are purely coincidental. Stories in this blog represent the views of the authors and are not endorsed by AWS.

Up until now we have been using Cost Explorer, an AWS service for cost analysis, to look at our costs. To do so, we enabled Cost Explorer following these instructions a few months ago. We did the tutorial which taught us the basics but over time, as we experimented with slicing and dicing the data we discovered that this service can do so much more than just help with cost analysis.

For the first months, we kept track of the costs broken down by account (using Group By=Linked Account ) and using the Daily view (click on the Monthly dropdown to switch) in order to see trends. We looked for unexpected cost spikes and monitored that any temporary increases go back to their baseline.

Traffic patterns and their impact on cost

Below you can see that our production costs decreased in February, which is expected as the weather turns very cold and our sales go down:

Production costs decreased in February, as sales go down

We also found the line chart to be especially useful for discerning spikes, since bar charts might hide an increase in one service with a decrease in another. Below you can easily see the cost impact of the traffic spike during our Winter Vegetable Sale event on April 1st:

Positive impact on cost of the traffic spike during our Winter Vegetable Sale event on April 1st

We looked at the breakdown by AWS services (using Group By=Service) with the Daily view, to see if there are any unexpected cost increases for a particular service or perhaps even a service that we weren’t expecting altogether.

Who caused the big bill?

In one particular case, we noticed a cost increase for an account. To see what was causing the increase, we added a filter for just that account ( chose a Linked Account in the Filters section on the far right). We then added a Group By=Service. Now we could see that the increases around June where in EC2 (purple bar) and database costs (green bar):

significant upward trend in cost for EC2 and RDS database instances

To investigate further, we added a filter of Service=EC2 and then a GroupBy=InstanceType. This allowed us to see that very large EC2 instances were being used. We repeated this with Service=RDS and observed the same. A quick investigation with the responsible team gave us the culprit, a new (expected) workload. The developers had chosen EC2 and RDS instance types with enough storage and performance to meet expected demand for the next 5 years, which led to using expensive instances.

Although projecting server needs well into the future is normal for on-premise infrastructure, it is unnecessary for EC2 and RDS instances, since they can be easily modified into a bigger instance type with only a small outage penalty, sometimes even in seconds. The team was instructed to recalculate how much computing power they need for the next 6 months instead, and then downsized their instances to what turned out to be a fraction of the compute and memory footprint they had originally selected, thereby driving down the cost of their project considerably.

Spotting trends with your new BFF: the Cost Explorer Dashboard

What became especially useful was the Cost Explorer Dashboard (Home link on the left navbar), which compares this month’s usage vs last’s month usage and puts its findings in the “trend” section. We have someone take a glance at it on the main payer account on a weekly basis and see if anything stands out. Some cost-focused dev teams are even looking at this on a monthly basis in their individual accounts.

Cost Explorer dashboard with trends section highlighted

In March of last year, it helped us identify that a significant increase in EC2 costs was for keeping load-testing EC2 instances idling after the tests were done. We shut them down around mid-March so that they didn’t incur any more unnecessary costs, and are looking into automating catching these ‘surprises’ via AWS Budgets:

Spotting the rebels

A month after we activated the cost allocation tags we checked the breakdown by the servicename tag (in the GroupBy section, choose More→Tag → servicename, then click on the Monthly dropdown and choose Daily). Unfortunately, with December vacations in the horizon we still had a gross majority of workloads without the servicename tag applied (see below).

In January, we resumed reminding teams on the weekly company call to sort their tags out.

..another cost spike? Again??

By February, the majority of the workloads had applied their tags and we started to get a nice breakdown of our costs by application. Thanks to this new way of slicing and dicing the data, we noticed an unexpected uptick in a cicd workload in March and informed the corresponding team to investigate. The bar graph below allows us to see the cumulative charges, but we also switch to line graphs whenever we want to see trends within a particular tag.

Using the servicename tag to track which apps have unexpected cost increases

Out with the old, in with the new

Another thing that caught our eye was that a few previous generation EC2 instances (c3) had worked themselves into our AWS accounts under the false assumption that these would be cheaper than current generation instances (c5). These showed up in Cost Explorer when applying Group By=Instance Type.

Current generation instances are similarly priced or even cheaper than their aged counterparts, and they also offer better performance. The teams looked into how to upgrade them and over the next few weeks we monitored the upgrades by watching the usage hours of the prev gen EC2s go down (Filter by UsageType=BoxUsage**) while the others go up.

You can also observe the cost diminish in one and go up in the other (GroupBy=InstanceType):

monitoring the costs of previous generation instances going down while the newer generation instance cost goes up

Conclusion

We’ve detailed the first steps on our cloud journey to getting properly set up to analyze our costs, and provided a few examples on what to look for and what lessons learned can be derived from them. We will probably find out more tips and tricks as we gain more experience and move more workloads onto the cloud, so don’t forget to check back every once in a while!

We’d love to hear from you, so please let us know if this post was good, bad, or ugly 😄

Useful Links

(Part 1/2 of this post) How to set up a proper tagging strategy for cost analysis: