Do you really know what your cloud apps are costing you?

Published in

My Local Farmer Engineering

7 min readJun 1, 2021

Keeping tabs on your cloud costs, Part 1/2

The question “what is the Total Cost of Ownership for this app?” has plagued the IT world for many years. And it’s no surprise, because the answer gets more difficult to figure out as companies grow larger, and assets and human capital get shared across applications… which is exactly when the question gets asked.

Disclaimer
I Love My Local Farmer is a fictional company inspired by customer interactions with AWS Solutions Architects. Any stories told in this blog are not related to a specific customer. Similarities with any real companies, people, or situations are purely coincidental. Stories in this blog represent the views of the authors and are not endorsed by AWS.

Now that we have resources on the cloud, some of the guesstimating of on-prem app costs goes away since we receive an AWS bill for the cost for infrastructure. But since it’s now easier to see trends, other questions start popping up, as you see costs sometimes increasing but cannot always tell from the bill which resources are causing those increases..

This issue gets aggravated every now and then when we get unexpected increases in our bills as well. For example, a couple of times we had unexpected spikes of 15% and 22% in our AWS bill.

Now… we understand that the bill fluctuates based on the on-Demand usage of our AWS resources, a definitely plus as we no longer had to buy capacity for our Data Center to meet peak demand.

However…on one occasion we saw a cost increase but weren’t really expecting an increase in traffic… so what had caused it? Which resources? Was the source still a legitimate increase in traffic? Were we getting hacked? Was a disgruntled employee standing up servers to mine Bitcoins? Did a developer stand up their own personal infrastructure on company resources?

There are two kinds of issues here, one is security related, and the other is about cost management. While our dedicated security team looked into the security issues, a new cross-functional team was formed to research our options for tracking costs. The team was comprised of a couple of financial analysts and several IT Directors, and they recommended following the guidelines below to help us track cost allocation.

Tag your apps, not your friends

To make sure that we can discern the cost between applications as well as other factors, we are going to pursue a tagging strategy.

But first, some background: a tag is a key/value pair assigned to AWS resources such as EC2 instances, RDS databases, or networking resources such as VPNs. Examples of tags are:

servicename: farmerapp
application: farmer-backend
release: 1.1.2

In the first example, servicename is the tag’s key and farmerapp is the tag’s value. Tags can be used for:

cost analysis, e.g. what is the cost for the farmerapp application across all environments and AWS accounts?
grouping resources that belong together, e.g. what are the all the resources belonging to the farmerapp application?
restricting actions taken by other services, for example to perform automation or management actions. E.g. only upgrading EC2 servers tagged as belonging to the farmerapp application.
applying tag-based conditions to IAM policies, e.g. only allow farmerapp tagged resources to access an EBS volume’s data.

A tagging strategy is an agreement on a set of tags to use across all AWS resources company-wide, such that the tags can be used later for cost allocation by our financial team, or resource management by the infrastructure and developer teams. Those tags that are to be used for cost analysis need to match the type of financial reporting we usually do, we will cover this in the next section. We have decided that the following tags will be mandatory:

servicename
environment
teamemail
costcenter

Note: we failed to stress to our teams that tags are case-sensitive, and wound up with a few resources with camel case versions of the tags. This rendered them useless for data mining or automations thru other services and the teams had to spend extra time to manually correct them.

The servicename tag will allow us to discern cost between applications and know which resources belong to which applications at a glance. This can help us track which applications and AWS resources seem unusually expensive or are rising in cost, so that we can make decisions on whether to optimize for cost or performance, or perhaps build or buy an alternative solution altogether.

The environment tag will help us analyze cost between dev, test, and production environments. This allows us to track costs for performance loads on the test environment, traffic spikes in production, or Proof of Concepts on the dev environment. We will be able to double-check that costs for dev and test environment match development and testing cycles, and could help us make determinations on whether to turn off infrastructure during nights/weekends. In fact, we have already reaped the benefits of this tag when one of our admins noticed that the cost of a test environment for an application went drastically up, then decreased slightly but never returned to its base level, not even after 3 weeks. An investigation uncovered that there had been a performance load test executed and while the load balancers had effectively downscaled the test servers for that application after the requests died down, the cluster of large EC2s used to create the thousands of requests had been left idling, incurring unnecessary costs.

The teamemail tag was included not for cost analysis, but for easily pinpointing whom to contact when questions arise concerning a resource. For example, a developer stood up resources a month ago on a shared account and forgot to terminate them. On another occasion, a developer stood up costly resources for a POC and left the company, which left the resources accumulating charges for several months until someone finally concluded that they were actually unnecessary. Having the resources tagged with a team email pinpoints whom should know whether a resource is still needed or not and saves us days of chasing our own tails.

The costcenter tag is used primarily by the project owners and financial team for knowing which budget to subtract workload costs from, and gives us a way of reporting costs up the chain of command.

The pain of getting there

Rolling this tagging strategy out was somewhat painful, as it required all developer teams to add them to all existing resources. It’s very easy to add the tags to all children of a CDK construct or a CloudFormation template, you can use Tags.of(myConstruct).add('key', 'value'); for CDK or add the tags to a CloudFormation stack when deploying it. However, this still took a couple of months to complete due to competing priorities, a few people failing to take it seriously until the last minute, others fell through the cracks completely and had to be chased down by their manager, and some ran into issues where adding tags programmatically was not supported by CloudFormation for a particular resource type, and therefore had to add them manually.

Besides those limitations, we noticed other ones as well. For example, some resources might not support tagging at all, and we had to think about what to do with resources that are shared between apps (think security, authentication, etc). For the latter we decided to leverage the servicename tag to depict such shared resources so that their costs don’t fall through the cracks.

All in all, rolling out a tagging strategy turned out to be a project in itself with an appointed an informal “tagging compliance officer” in charge of making sure all teams complied for all of their resources. Once the teams gained momentum on applying the changes, they finished quickly and we were able to have a much more thorough insight into our cost and usage.

After the initial push of applying tags was done, we were left with the challenge making sure people keep complying with the tagging strategy when spinning up new resources. We will cover this in a later blog post, stay tuned…

Cost Allocation Tags

Once we had a tagging strategy in place, it was time to register them as “cost allocation tags” so that they can be used in cost and usage analysis, for example within the Cost Explorer service which we will cover next. There are two types of cost allocation tags:

User defined: these are tags added to resources by users manually through the console, or programmatically (e.g. thru a CloudFormation stack). We are going to register all the tags that are part of our tagging strategy as cost allocation tags.

AWS generated: these are tags predefined by AWS and automatically added to resources when they are created. They are prefixed with “aws”:, e.g.:

aws:cloudformation:stack-name and aws:cloudformation:stack-id: information about the CloudFormation stack that created the resource
aws:createdBy: information about the IAM User, assumeRole, etc that created the resource

These are only visible within the Billing and Cost Management console, and we are going to register all the tags AWS-generated tags as cost allocation tags. Since these tags are populated based on best-effort basis, we will only use them as a fallback mechanism for resources that maybe didn’t get tagged properly.

In order to activate a tag as a cost allocation tag:
1. Go to the Billing and Cost Management console
2. In the navigation pane, choose Cost Allocation Tags
3. Select all the tags you would like to activate as cost allocation tags
4. Choose Activate (allow 24 hours for the activation to take place).

OK, now that the tedious part is done, you’re ready for learning what to do with all this! We will go through several examples using these tags which helped us recognize significant trends,… and a few rookie mistakes.

How much do your cloud Apps really cost? 2/2

Track down runaway cost in Cost Explorer

medium.com