Primary KPIs For Cloud Cost Optimization

Sethmcdonaldcode
4 min readMay 14, 2024

--

The benefits of moving operations to the cloud are undeniable. Besides the practical advantages, it’s rare to find an organization for which cloud infrastructure doesn’t represent significant cost savings over managing on premises hardware when total cost of each is compared. Still, cloud operations often come with a bit of sticker shock. Unexpected costs are quite common for a myriad of reasons and it’s easy to incur unnecessary costs, both recurring and unexpected, with cloud infrastructure. Obviously, any organization is going to want to limit unnecessary costs as much as possible.

Before we can try and combat wasted cloud spend we have to know what we’re looking at. Simply deciding that your bill “shouldn’t be this high” isn’t an adequate starting point. This piece is going to look at some basic KPIs that go a long way towards shedding light on what exactly is going on. We have to understand which resources are costing how much, what they are supposed to be doing, what they actually are doing, and what it all means. The first two points (how much things cost and what those things are supposed to be doing) are handled via AWS cost allocation tags. This article assumes that your resources are tagged with cost allocation tags and tags representing the usual suspects such as environment, etc. In terms of finding out what things are actually doing and deriving some meaning from it all we can look at the following metrics:

  • System Utilization Rate: This is a general overall percentage that illustrates how much of your system is being used compared to the absolute most that your system could handle without auto scaling. It’s up to you to decide what to include here but it should be limited to the capacity that you pay for without accounting for any extra capacity that jumps into play in response to usage spikes, etc.
  • Formula: SUR = (Utilized Resource Capacity/Total Resource Capacity) * 100
  • Target Allocation Metric: This is how big a piece of the pie a resource or a group of resources takes every billing cycle. This is a general formula that you can tailor to get an idea of just how much of your overall cloud cost is being caused by any particular thing. If you have targeted a resource or group of resources for optimization over the longer term, this is often the metric to take a look at every month to see how you’re doing where a smaller and smaller percentage every month means that whatever you are doing is working.
  • Formula: TAM = (Cost of Specified Resource or Resource Group/Total Cloud Cost Per Billing Cycle) * 100
  • Development Allocation Metric: A common application of the above metric is to understand what percentage of your cloud costs are spent on development and testing vs the actual delivered product. You want to include absolutely everything that wouldn’t be there if you decided to stop all development forever. That means dev and staging environments, feature environments, all databases that are not production databases. You should not include anything associated with production and that excludes any mechanisms that might serve the needs of both development and production (deployment items, etc).
  • Formula: DAM = (Total Costs of All Resources Not Associated With Production/Total Cloud Cost Per Billing Cycle) * 100
  • Wasted Resource Metric: A fairly simple and yet often a very eye-opening metric. It’s simply the percentage of your resources that have sat idle during any given billing cycle. Using this metric paired with the SUR above can start to give an idea of where true wasted cloud spend exists.
  • Formula: WRM = (Number of Idle Resources/Total Number of System Resources) * 100
  • Resource Inactivity Rate: How much of the time a particular resource or group of resources is actually active. You want to be careful to keep in mind that often there are resources who’s job it is to exist but only come to life and jump into action under a particular set of circumstances that may or may not ever materialize (an example might be some sort of standby instance)
  • Formula: RIR = (Number of Seconds of Resource Activity/Total Number of Seconds in Billing Cycle) * 100
  • Freemium Allocation Metric: What percentage of your cloud costs go solely towards freemium products.
  • Formula: FAM = (Number of Resources Dedicated Solely to Any Freemium Product/Total Number of Resources) * 100
  • Freemium Allocated Cost: How much of your cloud costs are dedicated solely towards freemium products.
  • Formula: FAC = FAM (from above) * Total Cloud Costs

Obviously, your own organization’s specific needs will have to be addressed using a more customized set of metrics. The ones that we’ve explored above make very good basic starting points and when viewed in combination with each other can reveal a lot about where your cloud dollars are going and whether they need to keep going there.

--

--