How to run a successful AWS Cost-Optimization Program

Keiran Holloway
8 min readMar 5, 2023

--

From the outset, I’ve personally been involved in and have run many cost-saving initiatives across various cloud platforms. Some of these initiatives have been incredibly successful. In certain enterprise engagements, we were able to find over $1,000,000/month in savings, which represents a significant amount of wastage in the system. I have extensively discussed this topic in various forms in the past including at:

That said, there are also many occasions where I have seen cost optimization activities go awry, and there are typically a small number of trends that I see between very successful programs and those which fail. This blog post is intended to discuss these areas which exist most commonly in successful cost-saving programs.

Cost optimization typically consists of a number of stages that are generally linear and look something like this:

This list is particularly high-level, and the exact process itself will largely vary from company to company. However, a successful cost optimization program will incorporate all of the above elements.

To provide a bit more substance, I am going to walk through each of the 8 phases below and discuss the actual activities that are completed. I will also discuss the necessary types of skills and knowledge required at each phase to ensure success

1.Identify Potential Costs Savings Opportunities — There are generally two approaches around this activity:

  • Using the native tooling that the hyperscaler provides — AWS provides tools like the AWS Cost Explorer, AWS Trusted Advisor and AWS Compute Optimizer. All of these are really good starting points to get some ideas around what services are being consumed and where your cost is concentrated within AWS. These tools also provide some guidance around where services are unused or underultilized and could be right-sized. They also provide some hints around services which are likely misconfigured or simply not required. This is good place to start to get some ideas around where to save some money.
  • Using third party cloud management platforms — There are numerous third party companies out there which will consume all of the raw data from the cloud hyperscalers, review the data and then provide semi-intelligent recommendations around how you can save month. Some of these include VMware’s Cloud Health, CloudZero, CloudCheckr and various other vendors.

These tools will give you a good head start in looking for opportunities to save some money. Typically, this activity can be reasonably completed by anyone who is halfway confident with the cloud and doesn’t require a lot of specialist knowledge. If you can log into the AWS console, you can likely access cost explorer and the trusted advisor (make sure you have an enterprise support agreement for all the recommendations!). Plugging in a third-party cloud management platform is also relatively straightforward, and all vendors provide relatively simple guides to get this working. Naturally, additional costs will apply when using third-party platforms, but ultimately, they will likely save you substantial time.

An external supplier could be of great benefit in setting these up and then looking for and reporting on immediate cost-saving opportunities.

2. Add Context around these cost savings opportunitites

The recommendations provided in phase one are generated by automated tools. While some of these tools boast about using ML/AI technologies to make recommendations, the reality is that they simply look at the data within the platform and make recommendations.

What they lack is context, which is really important. For example, I can tell you how to reduce your cloud bill to $0 immediately. The obvious approach to accomplish this is by turning everything within AWS off. However, this is not likely to yield a particularly great business outcome, especially when the AWS platform is what generates the revenue to pay your wage.

Context means understanding what each resource within the cloud is used for and understanding what business value it generates. There could be some very good reasons why EC2 instances are sitting there idle (for example, if participating in an active/standby cluster configuration). Turning off the standby node is likely to result in some pretty major unintended consequences!

To understand the context of the cloud environment requires specific tribal/institutional knowledge from within the organization. This cannot easily be provided by individuals outside of the application owners or the platform team that runs the cloud environment. Do not expect external contractors or vendors to be able to provide meaningful context against the recommendations initially provided.

3. Devise a high-level plan

Once the cost savings have been identified and sufficient context has been added, a high-level plan can be devised. Some examples of high-level plans may include the following (certainly non-exhaustive!):

  • Changing Compute instances — Underutilized EC2 instances could be converted to small instances. Similar with other compute-backed resources such as ElastiCache nodes or RDS instances. Using instances that are better sized (or different types) could save you money. Using current generation hardware will always give you better ROI.
  • Tuning Autoscaling Groups — Making sure that you’re scaling up and down with traffic volumes is key to leveraging the best value in the cloud.
  • Re-architect solutions — Using cloud-native primitives (such as containers and serverless) can often be more cost-effective. In the cloud, architecture and cost optimization are often the same thing. This could be a sustainable amount of work but can significantly reduce costs (I’ve seen cost reductions starting at 10x).
  • Thinking about data retention policies — I’ve seen savings of many, many thousands a month when data was being kept indefinitely within S3, DynamoDB, or even RDS. Think about approaches that will move data to low-cost storage or even expire content out if it retains no value.
  • Turn off services when not in use — Do you need your dev environments 24x7 when the developers only use it 60 hours/week? Coming up with a plan to turn these off outside of hours will save a lot of money.

It is important to devise these high-level plans with the context gained within phase 2 and then plan the approach correctly with all of the stakeholders, especially if there will be an element of service disruption while the changes are implemented!

These plans can be prepared in isolation and with engineers who can enact the change, but you need to make sure that all context has been obtained in phase 2; otherwise, getting buy-in and commitment later in the process will be hard!

Be sure to properly quantify how much this saving represents to the business as part of this plan!

4. Review with stakeholders and get commitment

During this phase, the savings (dollar value) can be understood, and the proposed high-level plan can be articulated to all stakeholders. With this context, the stakeholders should include the application owners, end-users (where appropriate), cloud office, platform engineering, as well as finance, security, and compliance. It is important to ensure that all parties understand the rationale for why this is being done, appreciate and understand the high-level approach, and have the opportunity to voice if any context has been missed

5. Build out an implementation plan with low-level detail

At this stage, the ‘what’ has been proposed, and there is a level of buy-in from all the necessary stakeholders. The next step is the “how” this is going to be practically implemented. This needs to consider some (or all) of the following:

  • What is the impact while this is being implemented? For example, will there be disruption to services? How does this impact services both up and downstream of the infrastructure being changed?
  • How will this practically be completed? For example, is this something that can be implemented using your IaC tool chain, or does it require some other approach to transition? For instance, a blue/green deployment pattern, etc.
  • Who will be doing this work? While a cloud engineer can likely complete and execute the change, what involvement is required from the application owners?
  • What is the testing plan? How do we know that the change has not had an adverse impact on service? How do we complete this in a lower environment, validate success, and then promote it through a higher environment?
  • What is the rollback plan? If this doesn’t work, how do we untie all of this?
  • Decide when this change should be implemented.

There is a lot of consideration in this phase, and the above should not be considered authoritative, and it will largely depend on how your cloud team is oriented. Consideration also needs to be given around change management and change approval processes (for example, presenting the change at CAB) if necessary to process the change and have it approved.

6. Get approval!

Assuming that you have successfully completed all the previous steps, you will hopefully have the green light from all key stakeholders to progress with the change.

7. Implementing cost saving recommendation

Now we get to save some money!!

During this phase the change should be implemented in accordance with the proposed plan. Importantly, this should be tested thoroghly and validated that this is working. This should be validated both with the application owners but also the end users and other stakeholders.

This can be implemented by a cloud engineer and success can be validated at the time but the end users should also be consulted around.

8 Record and Report on cost savings

Finally, the most important piece is to actually recognize the savings! Cost savings in the cloud can be particularly challenging to pinpoint. Due to the self-service nature of the cloud, as you optimize for cost, you can find that other users are putting more workloads in, and the cost savings get eroded.

It is crucial to track these cost savings (remember, these are month-on-month savings) to ensure that the effort doesn’t go undetected!

For bonus points, think about how you can move from cost optimization into a more mature way of working where you think about cost governance (preventing the waste before it occurs!). This is something that I talk about in my blog post Focusing on cost optimization? You’ve already wasted money.

Thank you for taking the time to read this article. If you’ve found value in it, please consider supporting me by liking and following me. As a content creator, I strive to provide valuable insights and knowledge for free, and your support means a lot. It’s not about financial gain, but rather a way to show appreciation for the effort and time put into creating this content. So, let’s make a gentleman’s agreement — you keep reading and learning and liking my posts, and I’ll keep sharing my expertise with you. Thank you for your support!

--

--

Keiran Holloway

Technical Lead and Engineering Manager with over 20 years running complex public infrastructure. Strongly passionate about continous learning and improvement.