Tales of AWS (or Any Other Cloud) Cost Optimisations

9 min readSep 18, 2020

Let me tell you how most, if not all the uneducated cost optimisations start, hit a wall and inevitably never achieve the long term results they’ve set out to realise and how you can avoid some of the pitfalls and save a few thousand, if not millions of $/£/€ along the way.

This post is meant to allow all those companies that are not flooded by VC capital or have budgets to hire 7 figure teams of engineers to look after their cloud estate (a reasonable/unreasonable argument can be made that if you can’t afford said teams, you’re better off going the “buy” route instead of “build” but I digress, that can be a topic for a different post).

The beginning — Reservations

As with any “build-vs-buy” tale, there comes a time in all projects that use cloud providers where your Finance/Accounting colleagues schedule a meeting with a title that goes a little something like this: “AWS Cost Analysis”; “Cloud Costs Review”; “Finance — Cloud Sync”. This is usually because some high-level person has seen the current cloud spend and uttered some expletives at how much a project or department is over budget. Said meeting ends up becoming a multi-part engagement in “explain to me what this section of lines in our bill are, why they’re so huge and how you can make them smaller” that would be better replaced by having said colleagues attend the AWS Cost Management Curriculum or enrolling on the AWS Cloud Financial Management Course.

Amongst the back and forth between both parties, the engineering team will inevitably retort to some comment about the EC2 spend with the following: “Well if you’d allow us to reserve instances for 1 or even 3 years, we could reduce that spend by up to 60%”.

And herein lies the first pitfall. To think that the answer to a large expenditure halfway through the year is to double down and poney up a multi-year commitment in the same financial calendar is equivalent to seeing that the living room is on fire and proposing that the best solution is to set up a burnout in the hallway to protect the rest of the house. While technically this might be the “smart” thing to do, to the uneducated, it might sound like a crazy plan and it usually ends up resulting in a long process that involves creating a spreadsheet that looks something like this:

Random numbers for illustration purposes

Said spreadsheet will inevitably go ten rounds up and down the org chart until it’s deemed “too expensive to do this year” and is best kept in the shelve of “todo projects”.

Reservations, much like Volume Discounts, are a cost-saving strategy that requires forethought, time, reliable forecasting data and a mature understanding of the organization’s use of a given resource. None of which is clearly in place since you’re doing this exercise. Instead, Reservations should be either the mid or last step in the cost optimization plan. After you’ve gathered most or all of the above requirements, then you can approach your cloud partner of choice to discuss terms.

Bonus savings: RDS and Elasticaches can also be reserved and since they are usually an always-on asset, it’s a no-brainer to reserve capacity in these categories.

The middle — rightsizing and scheduling

Once you’ve surpassed that first hurdle, you start arriving at more reasonable solutions such as turning off or scaling down your environments during off-hours and weekends and rightsizing your instances according to their use.

Assuming that you’re using infrastructure as code in the form of CloudFormation, Azure Templates or Terraform, then this should be a quasi-trivial task of rolling out a new version of your infra. However, if you have a less mature infrastructure setup or you’ve had enough staff turnover that most of it haven’t been run in months/years then this might end up being just the opportunity you needed to get that “modernisation” plan implemented.

The second pitfall that’s too easy to commit is to begin reducing and downsizing all the development environments and tooling instance to the point where your teams can go grab a full 3-course meal between build/deployment pipelines. You should focus instead on the following strategies:

Multi-Tenancy

Making use of cloud multitenancy (ECS, EKS, GKE, AKS, Fargate, …) to use your compute estate to its fullest capacity. If your product stack is partially or even fully dockerised, then it makes no sense to have them running in separate machines. Economies of scale matter here and container schedulers are there to ensure your workloads are up and stay up.

Spot instances are not (that) dangerous or scary

Consider shifting all non-production workloads/environments to a percentage (or fully) to spot instances. With fleet autoscaling groups, it’s a no-brainer to switch your non-critical and non-persistent workloads (read web and application servers, and not DBs) to spot or mixed tiers of pricing model. Just beware that if there’s an AZ outage, all that capacity will be pulled to serve on-demand requests. So if your workloads are even mildly important, consider setting an “on-demand” percentage value >0% to guarantee that if your spots do get pulled, at least your environment will be just a least less broken. An extra step you might want to consider is switching your bastions to also use Spot Instances, these are usually only transient machines and don’t hold any persistent data.

Scale your Staging/NonProd environments down when not in use

One of the biggest fallacies when it comes to modern Software Development is that you need a staging environment the size of production all the time when in fact, you need a staging environment that’s architecturally the same as production and only the same size when you’re doing stress testing. At any other time, you only need it as big as your testing userbase. So allow it to scale down and shutdown overnight and you’ll see massive savings there. Bonus points for seeing how your scalability will work under increased stress and how fast it will scale once your performance testing suite starts hitting those load-balancers. If this last sentence is foreign to you, feel free to reach out and I’ll happily point you in the direction of someone who’ll provide their time for a reasonable fee to help you scale your environments.

Implement dynamic scaling in Production environments

Source: https://www.slideshare.net/AmazonWebServices/ent101-embracing-the-cloud-final

If you’ve successfully implemented scaling in non-prod environments, then it’s time to tackle production. Unless you’re Google or Amazon and you operate a 24/7 service, your usage pattern most likely resembles this image. This is a perfect opportunity to leverage scheduled and monitoring based scaling. Understand your usage patterns and adapt to them.

The (never-ending) end — Processes, Architecture and Reservations

So now that you’ve addressed most of the quick wins, it’s time to address the systemic issues. In this section, I will focus on AWS for the most part due to some idiosyncrasies of their billing model but rest assured other cloud providers have their constraints. The reasons that have caused you to start this exercise most likely stem from the following sources:

Less than adequate Architectural Design
Incorrect usage of High Availability capabilities
Lack of Platform hygiene practices and processes.
Lack of Adequate Knowledge

Let’s deep dive on some of these topics:

Architectural Design Review

One of the most common designs to have when starting in the cloud involves logically isolating environments (dev-test-stage-prod ) or stages of environments (dev-nonprod-prod) in different VPCs. These VPCs, if designed securely, will require NATs or NAT Gateways (times the number of AZs), Internet Gateways (again, times the number of AZs), separate clusters or autoscaling groups (which don’t usually gain with the economies of scale of small environments) and many other components. If you’re then required to connect environments via private routes rather than using publicly exposed endpoints, you’ll have to also consider either VPC Peering or other forms of connection. All of which will cost you a considerable amount of money.

Compare this with oversizing your non-production VPC subnets and setting up your automation to allow for side by side deployments of multiple environments within. Need to connect two logical components? Security groups can easily be referenced by name. Need to have secure private connections between multiple environments? Internal load balancers are readily available and easy to configure. The benefits go on and on. Just ensure that you’re tagging and enforcing naming conventions so your inventory doesn’t get out of hand and you’ll be fine. (obviously, this does not apply to highly regulated environments which need to follow pre-defined standards and reference architectures).

Many other small changes can have big impacts on how efficiently you use the resources you pay for.

High Availability Capabilities = Expensive Network Traffic

Corey Quinn put it best with this tweet:

It takes skill to do a spit-take without a drink! 🥃

I won’t spend too much time on explaining how AZ network charges on AWS work (I’ll leave that to Corey’s amazing blog here) but suffice to say that it costs twice as much as assumed every time data goes from Zone A to Zone B and four times as much is you happen to respond with a payload from B to A, you know as every modern system does. So you can see how easy it is for a poor multi-AZ design to get expensive quickly.

Unless you need your dev environments to survive a full AZ outage, they probably don’t need to have their web tier in AZ-A, App tier in AZ-b and DB on AZ-C. The name of the game for non-critical workloads is Availability Zone Affinity.

Does your dev1 “inventory” service have an identifiable data flow? Can that flow be isolated from other services? If the answers to those questions are yes, then you can probably pick one AZ at random in your favourite region, drop your database + container/servers in the subnets assigned to that AZ, set the right affinity constraints (if you are using ECS for example, here’s the doc; for EC2, placement groups have you covered) and see your cross availability network traffic charges drop down to zero for that data flow(and may even get a bump in latency). Apply this methodology for all the non-critical components you can apply it to and you’ll see a massive reduction to your AWS bill.

Process and automation are kings

Now that you’ve tagged all your infrastructure that’s living neatly together, you can go ahead and begin automating shutdown and termination of unruly infrastructure. Fortunately, there’s an open-source product that can get you started on this journey. Cloud Custodian is an amazing Cloud Security, Governance, and Management tool that in their own words:

(…) enables users to be well managed in the cloud. The simple YAML DSL allows you to easily define rules to enable a well-managed cloud infrastructure, that’s both secure and cost optimized. It consolidates many of the ad-hoc scripts organizations have into a lightweight and flexible tool, with unified metrics and reporting. Custodian supports managing AWS, Azure, and GCP public cloud environments.

Do note that Cloud Custodian does not require you to use Terraform to manage your infrastructure, or care that you use ARM Templates or the gcloud alpha CLI to spin up your clusters. All it cares is that the infrastructure that should be up from 9–5 should have a certain set of tags and meet the right filters. Anything that doesn’t meet those gates can be “actioned” on to achieve the desired outcome. You can choose to shut it down, to email the relevant team or even outright terminate an instance a few minutes after it was launched.

There are many other products and even custom-built solutions based on trusty bash scripts or Jenkins jobs that can achieve a similar result, but the mindset shouldn’t change. If you’re paying for resources on an on-demand basis, then only keep them up for as long as you need them. Everything else should be automated to be restarted from automation to a useful state.

Education is the silver bullet

That’s it, that’s the message. Educate your users and your engineers to operate with a “cost-aware” mentality and they’ll be the force for the change you want to see in your landscape.

Closing

Source: https://pixabay.com/illustrations/idea-light-bulb-enlightenment-1296144/

The TLDR for this post is rather simple, your Cloud costs are a direct result of the design decisions and operating procedures you have. Download your AWS bills, understand your usage patterns, iterate on improving your design and procedures and your costs will start declining. If you need help, feel free to reach out and I’ll link you to people who know their stuff.