The Definitive Guide to Cloud Cost Optimization with Terraform

Michael Fonseca
Aug 18 · 14 min read
Image for post
Image for post

This article was cross-published on the HashiCorp Blog.

The Problem — An Engineers New Role Cloud “Financial Controller”

If you’re reading this, chances are you are in DevOps (or some type of Engineering) and you are wondering why on earth do I care about Cloud Cost Optimization?…that’s not my job, I’m not in Finance right?…WRONG!

Engineers are the new cloud Financial Controllers, and if you are interested in defining this new role, automating your newfound responsibility, and implementing a process for Cloud Cost Optimization with Terraform Cloud for Business & Enterprise read on. Now yes, Cloud Cost Optimization is important, and in this article, we will fully address it in the context of an overall model of Cloud Cost Management (also FinOps).

Please note that the majority of features reviewed in this article focus on Terraform paid functionality such as Cost Estimation and Governance & Policy but the core use case around cost optimization can be achieved with open-source.

The New Cloud Financial Model

With the continuous shift to consumption-based cost models for infrastructure and operations; i.e. Cloud Service Providers (CSPs), you pay for what you use but you also pay for what you provision and don’t use. If you do not have a process for continuous governance and optimization, then there is a potential for waste.

In a recent survey respondents stated:

  • 45% of the organizations reporting were over-budget for their cloud spending
  • More than 55% of the respondents are using either cumbersome manual processes, or simply do not implement actions and changes to optimize their cloud resources
  • 34.15% of respondents believe they can save up to 25% of their cloud spend and 14.51% believe they can save up to 50%. Even worse, 27.46% said, “I don’t know”.

First, let’s unpack why there is an opportunity and then get to the execution

In moving to the cloud, most organizations have put thought into basic governance models where a team, sometimes referred to as the Cloud Center of Excellence, looks over things like strategy, architecture, operations, and, yes, cost. Most of these teams contain a combination of IT Management and Cloud technical specialists from common IT domains and Finance. Finance is primarily charged with cost planning, migration financial forecasting, and optimization. Due to financial pressures, they tend to say “We need to do something about getting a handle on costs, savings, forecasting etc.” but have no direct control over costs. It is now Engineers that directly manage infrastructure and the costs.

The business case is simple, it is a financial paradigm shift where:

  • Engineers are not only responsible for Operations but now also Costs.
  • Engineers now have the tools and capabilities to automate and directly impact cost controls.
  • Cost planning and estimation of running cloud workloads are not easily understood or forecasted by Finance.
  • Traditional forms of financial budgeting and on-prem hardware demand planning (such as contract-based budgets and capitalized purchases) do not account for cost variability in consumption-based models.

Finance lacks control in the two primary areas of cost-saving:

  • Pre-provisioning: Limited governance and control in the resource provisioning phase.
  • Post-provisioning: Limited governance and control in enforcing infrastructure changes for cost savings.

In the following article, we will define the people, processes, and technologies associated with managing cloud financial practices with Terraform.

The People

To simplify things, we will assume there is some sort of team, i.e., the Cloud Center of Excellence, that is responsible for managing the overall cloud posture.

On this team there are four core roles:

  • IT Management
  • Finance
  • Engineering (consisting of DevOps and Infrastructure & Operations)
  • Security

In managing the “Cost of Cloud”, we will view Engineering’s role in the management of costs in the following context:

  • Planning — relating to Pre-Cloud Migration & Ongoing Cost Forecasting
  • Optimizing — Operationalizing and Realizing Continuous Cost Savings
  • Governance — Ensuring Future Cost Savings & Waste Avoidance

The following RASCI model can be used as a baseline of expectations for your team and, moving forward, we will focus on the Engineers role of “Responsibility” in these three primary areas.

Image for post
Image for post
Cloud Center of Excellence RASCI Model

As you can see, Engineering has a higher level of responsibility in today’s infrastructure operations. I have used this and similar models to define the roles and responsibilities of the Cloud Center of Excellence to many organizations. The RASCI model is effective but also make sure to account for Frequency and Workflow as these will differentiate from current IT cost models and you will want to set expectations accordingly.

The Process — Planning, Optimization, & Governance

Now we are going to take a look at how Engineers can use Terraform at each level of the Cloud Cost Management process to deliver value and minimize additional work. To get started, below is a visualization of how Terraform fits into the Cloud Cost Management Lifecycle.

Image for post
Image for post
Cloud Cost Management Lifecycle Terraform

The overall process can be summed up as:

  • Start by identifying workloads migrating to the cloud
  • Create Terraform configuration
  • Run terraform plan to perform cost estimation
  • Run terraform apply to provision the resources
  • Once provisioned, workloads will run and Vendors will provide Optimization Recommendations
  • Integrate Vendor’s Optimization Recommendations into Terraform and/or CI/CD pipeline
  • Investigate/analyze Optimization Recommendations and implement Terraform Sentinel for Cost & Security Controls
  • Update Terraform configuration and run plan & apply
  • Newly optimized and complainant resources are now provisioned

Section 1 — Planning — Pre-Migration & Ongoing Cost Forecasting

Cloud migrations require a multi-point assessment to determine the potential to move an application/workload to the cloud. Primary factors for the assessment are architecture, business case, the estimated cost for the move, and the ongoing utilization costs budgeted/forecasted for the next 1–3 years on average.

Old models of capitalization and amortization for application/workload costing done by Finance are a thing of the past, and now Engineering is responsible for managing operational costs. With Terraform, users can more clearly communicate expected costs with Terraform’s Cost Estimation functionality.

Using Terraform configuration files as a standard definition of how an application/workload is costed, you can now use Terraform Cloud & Enterprise API’s to automatically supply Finance with estimated cloud financial data or use Terraform’s user interface to provide Finance direct access to review costs and, by doing so, eliminate manual engineering oversight.

Planning Recommendations:

  • Use Terraform configuration files as the standard definition of costing across AWS, Azure & GCP for cloud cost planning and forecasting, and provide this information via Terraform API or role-based access controls within the Terraform user interface to provide Financial persons a self-service workflow.
  • Note: Many organizations conduct planning within Excel, Google Sheets, and Web-based tools. To make data usable within these systems we would recommend using Terraform’s Cost Estimates API to extract the data.
  • Use Terraform Modules as standard units of defined infrastructure for costing high-level assessments and cloud demand planning (for example): Define a standard set of modules for a standard Java application so module A + B + C = $X per month and we plan to move 5 Java apps this year this can be a quick methodology to assess potential application run costs prior to defining the actual Terraform configuration files.
  • Use Terraform to understand application/workload financial growth over time, i.e., cloud sprawl costs.
  • Attempt to structurally align Terraform Organization, Workspace, and Resource naming conventions to the financial budgeting/forecasting process.

Getting started with Terraform Cost Estimation is easy and the basic steps for Terraform Cloud for Business & Enterprise are provided in our Learn Guide. Once enabled, when a Terraform Plan is run, Terraform will reach-out to the AWS, Azure, and GCP cost estimation APIs to present the estimated cost for that Terraform Plan which can be used accordingly within your financial workflow.

Example of Cost Estimation Output in Terraform

Image for post
Image for post
Terraform Cost Estimation

Example of the Cost Estimation API JSON Payload from Terraform

Image for post
Image for post
Terraform Cost Estimation API JSON Payload

Now it is important to note that Terraform Cost Estimation provides costs based on a workspace view. If you would like a higher level, cross-workspace view, you will need to leverage the Terraform Cost Estimation API and a reporting tool of your choice. To give it a try there is a great little project named Tint from Peyton Casper, a HashiCorp Senior Solutions Engineer. Here is the blog to get started with Tint: Multi-Cloud Cost Visualization for Terraform and the project is hosted on GitHub peytoncasper/tint. In reality, any standard corporate reporting tool (e.g., Microsoft BI, Tableau, etc.) will work based on in-house requirements.

Image for post
Image for post
Example Dashboard from Tint

Section 2 — Optimizing — Operationalizing and Realizing Continuous Cost Savings

Optimization is the continued practice of evaluating the cost-benefit ratio of usage vs. provisioned resources and then adjusting that ratio to be most advantageous to your organization.

That said, many organizations have access to optimization recommendations from Cloud Service Providers such as AWS, Azure, or GCP or popular third-party tools. The main challenge of using these optimization tools is that organizations are not properly taking advantage of the recommendations.

What we see is a disconnect from the Engineering/DevOps workflow (CI/CD pipeline) where Engineering does not engage with these optimization systems. Therefore, there is no feedback mechanism and even more, there is a high level of manual intervention in optimization consumption when they are engaged.

Automating Optimization Insights into the Provisioning Workflow

It is safe to say that the major CSPs (AWS, Azure, GCP) and the vast majority of third-party tools provide access to export optimization recommendations via an API or an alternative method. For the purposes of this guide, we are going to focus on the basic steps/approach to automate the process of ingesting optimization recommendations which will come directly from the CSPs or from third parties such as Densify who maintain a Terraform Module.

Image for post
Image for post
Densify EC2 Optimization Example

The concepts and code can be used as a model for your own deployment. Please note that each Vendor provides a different set of recommendations, but universally all provide insights on compute, so we will focus on compute as a norm, but any insight that you receive can be consumed based on the pattern below (e.g. compute, storage, DB, etc.).

Basic patterns for consuming optimization recommendations:.

Establish a mechanism for Terraform to access the optimization recommendations. We see several common patterns:

  • Manual Workflow — Review of optimization recommendations from the providers portal and manually update Terraform files. Note: Not optimal — no automation, but a feed back loop for optimization must start somewhere!
  • File Workflow — Create a mechanism where optimization recommendations are imported into a local repository via a scheduled process (usually daily).
  • For instance, Densify customers use a script to export recommendations into a densify.auto.tfvars file and it is downloaded and stored in a locally accessible repository.
  • Then Terraform lookup function is used to look-up specific optimization updates that have been set as variables.
  • API Workflow — Create a mechanism for optimization recommendations to be extracted directly from the Vendor and stored within an accessible data repository and use Terraform’s http data_source functionality to perform the dataset import reference.
  • Ticketing Workflow — This workflow is similar to the File and API workflow but some organizations insert an intermediary step where the optimization recommendations first go to a change control system like ServiceNow or Jira. Within these systems there is workflow & approval logic built-in where a flag is set for acceptable change and is passed as a variable to be consumed later in the process.

Terraform Code Update Examples

In any of these cases, especially if automation is to take place, it will be important to maintain key pieces of resource data as variables. Optimization is a function of provisioned size and usage and the optimization provider will provide a recommendation to size the resource or service i.e. Compute, DB, Storage accordingly. As an example, we will use Compute, but the example is representative of all.

At a minimum, it is recommended that you have three variables set to perform the optimization Terraform update with some basic logic. Those variables and logic being:

Image for post
Image for post
Terraform Variable update recommendations

As an example, we will use Densify as a vendor-supported optimization process, but there are many HashiCorp customers & users that create their own Providers for similar processes. Their Terraform Module can be found via the Terraform Registry and the code found on GitHub Densify-dev.

In the following, you will see some basic updates of Terraform code with variables and logic to get you started. Below is an example of the variables created.

variable "densify_recommendations"{
description = "Map of maps generated from the Densify Terraform Forwarder. Contains all of the systems with the settings needed to provide details for tagging as Self-Aware and Self-Optimization"
type = "map"
}
variable "densify_unique_id" {
description = "Unique ID that both Terraform and Densify can use to track the systems."
}
variable "densify_fallback"{
description = "Fallback map of settings that are used for new infrastructure or systems that are missing sizing details from Densify."
type = "map"
}

Next, you will see updates with the Terraform lookup function to look-up the local optimization recommendations file (i.e. densify.auto.tfvars) for updates/changes. The optimization recommendations can also be auto-delivered by Densify using Webhooks and subscription APIs.

locals{
temp_map = "${merge(map(var.densify_unique_id, var.densify_fallback),var.densify_recommendations)}"
densify_spec = "${local.temp_map[var.densify_unique_id]}"
cur_type = "${lookup(local.densify_spec,"currentType","na")}"
rec_type = "${lookup(local.densify_spec,"recommendedType","na")}"
savings = "${lookup(local.densify_spec,"savingsEstimate","na")}"
p_uptime = "${lookup(local.densify_spec,"predictedUptime","na")}"
ri_cover = "${lookup(local.densify_spec,"reservedInstanceCoverage","na")}"
appr_type = "${lookup(local.densify_spec,"approvalType","na")}"
recommendation_type = "${lookup(local.densify_spec,"recommendationType","na")}"

Lastly, you will want to insert some logic to ensure that you are properly handling the usage reference i.e. if a recommendation is available use it, otherwise keep current. Note: Densify also adds some code in there as part of a change control process for their customers that are using ServiceNow or Jira (but this can be any change control/ticketing system). They have an option to first pass the optimization recommendation to be approved in one of these external systems and then pass an approval flag in as a variable to ensure that it is an approved change.

instance_type = "${local.cur_type == "na" ?
"na" :
local.recommendation_type == "Terminate" ?
local.cur_type:
local.appr_type == "all" ?
local.rec_type :
local.appr_type == local.rec_type ?
local.rec_type :
local.cur_type}"

For customers not using or not wanting a third party approval system, the recommendation changes will be visible on Terraform Plan. Similarly, they can also manually update a variable such as appr_type = false to avoid using the recommendation or use other similar methods via Feature Flags and conditional expressions in Terraform to control applied functionality.

The important point that we have gotten to here is we now have a defined process that can be partially or fully automated to make changes to our environment to optimize and save.

Section 3 — Governance — Ensuring Future Cost Savings

The last and critical component of the Cloud Cost Management Lifecycle is how do we stop cost overruns again?…and how do we ensure a continuous feedback loop for control? I have had this conversation with many organizations that have done optimization exercises and then costs shoot back up and it turns into a game of waste and delayed recovery, so let’s focus on waste avoidance from the start.

So how do we do that with Terraform Cloud & Enterprise? It is with Sentinel, a product embedded within Terraform for governance & policy. In the following steps, it is assumed that you will apply learnings from the optimization recommendations in order to apply policy for cost control.

Cost Compliance as Code = Sentinel Policy as Code

Terraform Sentinel is a Policy as Code engine that evaluates the resource that Terraform is managing against policy definition. Sentinel can be used to define policy on any and all data defined within a Terraform file. Common uses of Sentinel are to ensure provisioned resources are: secure, tagged, and are within allowable usage policies and cost.

Specifically focusing on costs, Terraform customers implement policy around three primary areas: (but there is no limit…you can get creative):

Cost Controls Areas:

  • Amount — Control the amount of spend
  • Provisioned size — Control the size/usage of the resource
  • Time to live — Control the time to live of the resource

In all three of these areas, you are able to apply policy or controls around things like Terraform Workspaces (e.g. apps/workloads), environments (e.g. prod, test, dev), and tags to ensure that spend and controls are aligned to optimize resources and avoid unnecessary spend.

The following is an example Sentinel policy output when running Terraform Plan. We will focus on three policies:

  1. passed — aws-global/limit-cost-by-workspace-type
  2. advisory failed — aws-compute-nonprod/restrict-ec2-instance-type
  3. passed — aws-global/enforce-mandatory-tags

Note: Sentinel has three Enforcement Levels: Advisory, Soft-Mandatory, and Hard-Mandatory — please refer to the provided link for definitions. The Enforcement Level will dictate workflow and resolution of policy violations. In addition, please note that Terraform Cloud for Business & Enterprise is fully API enabled and users may interact with the Terraform “UI, CLI, or the API” to fully integrate into their CD/CD pipelines for policy workflow control and VCS systems such as GitLab, GitHub, and BitBucket for policy creation and management.

Image for post
Image for post
Terraform Policy Check

Sentinel Cost Compliance Code Examples

In the aws-global/limit-cost-by-workspace-type policy defined for this Workspace (which can be individual or globally defined), we have applied monthly limits on how much spend can be provisioned by this Workspace and also an enforcement level. A snippet of the policy is visible below where we have defined a Monthly limit of cost (Dev = $200) and an enforcement level. Again depending on enforcement level users will need to address the policy violation accordingly, but most important is that we have a mechanism to control costs before those resources are provisioned.

Sentinel Cost Compliance — Monthly Limits

##### Monthly Limits #####
limits = {
"dev": decimal.new(200),
"qa": decimal.new(500),
"prod": decimal.new(1000),
"other": decimal.new(50),
}

policy "limit-cost-by-workspace-type" {
enforcement_level ="soft-mandatory"

Sentinel Cost Compliance — Instance Types

As a second example, for a multitude of reasons including compliance and costs many customers will restrict what compute instance types can be provisioned and potentially configuration limits based on environment or team. An example of the full policy can be seen here: aws-compute-nonprod/restrict-ec2-instance-type. In the example below, we have a policy that controls instance sizes on non-prod environments to ensure lower costs in these areas and we can apply a different policy to production if we so choose.

# Allowed EC2 Instance Types
# We don't include t2.medium or t2.large as they are not allowed in dev or test environments
allowed_types = [
"t2.nano",
"t2.micro",
"t2.small",
]

policy "restrict-ec2-instance-type" {
enforcement_level = "advisory"

Sentinel Cost Compliance — Enforce Tagging

Lastly, tagging is a critical factor in understanding costs. Tagging enables you to group, analyze, and provide policy around cost optimization.

Terraform with Sentinel provides the capability to enforce Tagging at provisioning and during updates to ensure that optimization can be targeted and governed. Tagging is managed in a simple Key/Value format and can be enforced across all CSPs. Here is a full sample policy for enforcement on AWS but Sentinel is highly flexible and can be similarly configured for any cloud provider (snippet provided below).

### List of mandatory tags ###
mandatory_tags = [
"Name",
"ttl",
"owner"
"cost center
"appid",
]

policy "enforce-mandatory-tags" {
enforcement_level ="hard-mandatory"

Summary

Yes, this article is long but the name did have a “Definitive Guide” in it, and the purpose is really to articulate that there has been an accountability shift in organizations to Engineering for Cloud Cost Management. Engineering controls the mechanism for costs and savings like never before.

In addition, as organizations continue to invest in Terraform, IaC, and Cloud Platforms, they can no longer operate in the siloed financial and operational processes as of today. To enact savings, you need to enact change and that is where Terraform comes in, reclaiming and optimizing resources moving forward as an ecosystem of solutions and feedback mechanisms.

If anyone has worked on projects in this space with Terraform that you would like to highlight or if you want more information on the subject, please feel free to reach out.

Note: Special thanks to Tony Pulickal — for insight and review

HashiCorp Solutions Engineering Blog

A Community Blog by the Solutions Engineers of HashiCorp…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store