AWS Cost Optimization
What is Cost optimization and Why do we need it?
AWS has built a framework called Well-Architected Framework for Builders like us. Which is built upon 6 Pillars and Cost Optimization is one of the pillars. Cost optimization can free up resources that can be leveraged to scale systems that are needed cost-effectively.
The Pace of Innovation and the number of products or features we build every day starts adding resources to our AWS accounts. Which starts raising questions like these once the Monthly bill grows significantly:
- How much spend is going towards a particular project/application/environment?
- Are there any particular applications that seem too expensive compared to others?
- Is a particular feature/service/resource too costly?
- Should we allocate resources to look into optimization or alternatives to driving down cost?
- Are there unexplained spikes in usage or cost? For which resources? Who do they belong to?
Let’s go over a few to-do’s, tips, and services we leveraged for AWS Cost Optimization/Governance that helped us answer these questions!
Tagging
We developed a Tagging strategy that gets applied to all of our AWS resources. Tags are key/value pairs.
- Name: {Prefix}-{Product_Name/Project_Name}-{Environment_Name} (e.g: BU-xyz-dev)
- Product: Product1/Product2/Product2
- Project: Project1/Project2/Project3/Project4
- Owner: Mayank Patel (Technical Lead of the Project who can take ownership of resources)
- ManagedBy: Terraform/CodeCatalyst/Person’s Name (This should represent how resources are deployed either through the Automation tool or the Person who manually provisioned the resource.)
- Environment: dev/staging/rc/uat/onboard/prod (Standard Environment Name across all business unit products)
- Description: AWS Aurora Database for Project1 (High-level short human-readable description)
- Unit: Business Unit Name
Tags are not only useful for Cost analysis but also pretty useful to identify a group of resources for a specific project/product. Tags also can be used for other services to perform action or automation for management purposes.
Adding these tags to resources does not allow filtering in Cost Explorer. So, need to work with the Admin of the Billing account to add some of these tags as User-defined Cost Allocation Tags and then you can filter in Cost Explorer by these tags. Reach out to you admins to add tags as needed.
Have a standard naming convention that helps identify resources easily and quickly. Here are some examples of BA resource naming conventions: ba-oi-staging/rc/uat/onboard/prod, ba-miq-dev/uat/prod, ba-ms-dev/staging/uat/prod, etc.
Budget Allocation and Alerts
We developed the AWS Lambda Function which can keep track of the allocated budget for a given project or Product and gather what Actual amount. This then triggers a notification to Microsoft Teams using SNS as shown sample in the below screenshot. As part of any new project/initiative, we allocate some budget thresholds. We simply have to add Project to Terraform reference and then start receiving alerts.
Trusted Advisor
Trusted Advisor is another very useful service. The trusted advisor provides Recommendations for all different Well Architected Pillars. However, for now, we will focus on Cost Optimization alone. Trusted advisor analyzes cloud watch metrics of different AWS resources against provisioned capacity and provides recommendations based on that.
This service provides recommendations for overall possible savings or is broken down by services or design suggestions. Trusted Advisor provides many options, however, I will list a few we follow:
- If Database Instances Idle for X amount of time and not being used. This brings the opportunity to either downscale or stop or eliminate it.
- Over Provisioned resources / Low Utilization on resources → Right Sizing of resources specification. For example, if the team requested 2xlarge Instances and Instances not utilizing resources over a certain percentage even at peak time then right sizing to xlarge and observe the usage.
Spot Instances
We leverages Spot Instances to get up to a 90% discount over On-demand instances for fault-tolerant stateless applications. Also, for Flexible or time-sensitive workloads like batch jobs or data pipelines. Spot instances are identical to On-demand instances of performance or Infrastructure platforms. Amazon has a fleet of resources and some of them are spare unutilized goes as Spot Instance.
We bid for Spot Instances by setting a price mark and getting resources if no one bids higher. This often provides great cost-saving opportunities and able to leverage large instances at a very cheap price.
Spot pricing cost also depends on Region, AZ’s, and requests from amazon customers for compute resources.
Things to know:
- If Spot instances are sold to other customers then this can cause failure in spinning up Instances
- If the Price of Spot Instances is higher than the threshold set by us then also causes failure in spinning up instances.
- Make sure your architecture is Flexible, stateless, and loosely coupled.
Our use cases where we leverage Spot Instances:
- Data Pipeline
- Runner Instances for CI/CD
- High performance Compute needed for Machine learning,
So make sure to design your workload appropriately.
Reserve Instances
Reserved Instances (RI’s) can offer up to 70% discount over On-demand Instances. If you are leveraging very heavy Compute resources like 8x, 12, or higher and you plan to use that for a period of year or more then that becomes a good use case to Reserve those instances and gain a discount.
RDS database instances generate the most cost for our accounts, so we Reserved these Instances for 1 year period. This brings x% savings for RDS instance costs. When you add up for a year this becomes a very large $ number.
Things to know:
- Have to plan for Growth and future compute needs making sure will not need to have major architecture changes.
- In a multi-account setup If one account reserved an instance and does not use it. Another account can leverage that if using specific instance families and classes
We all RDS Instances Reserved for 1 year period which provides a significant discount.
Savings Plan
Savings plans in most cases offer the same level of discount as Reserved Instances in most cases. However, it provides more Flexibility.
Have your Admin work with AWS account team to review savings plan for your usage and see if it makes sense.
Serverless
AWS allows Builders to leverage services for purpose-built design solutions. Serverless is definitely at the top of the list in this category. Also, we have been gradually increasing the usage of services like Lambda, Step Function, etc. This offers good savings over compute resources that are provisioned all the time and utilized very little.
Here is a very big list of Serverless patterns using different services:
Graviton Instances
Graviton is an ARM-based processor over the x86 platform. Graviton always provides cost performance over the X86 platform as well as previous-generation Graviton instances.
Quick example: M6g Instance Family provides almost the same spec as M5 Instances. However, Leveraging M6g Graviton Instance over M5 provides a 20% lower cost.
We are proud to share that we have 95% of RDS (Database compute) Usage been leveraging Graviton Instances for almost over a year now. The remaining 5% of x86 platform usage will be moved to Graviton Instances once our Project Development team can go through a database upgrade.
We also leverage only Graviton-based instances for all of Data Lake platform Workload.
AWS has 3 generations of Graviton Instances.
Cost optimization is not the only reason we are using Graviton Instances. One of my personal goal is to have the majority of our compute using Graviton Instances. Since that supports the sustainability pillar of the Well Architected framework and uses GREEN ENERGY.
I highly recommend all the builders who are designing solutions on AWS think of Graviton instances so we can leverage compute that uses Green Energy.
Things to Know:
- Graviton being ARM-based often time there is a major issue of packages not being available.
Resources:
GP3 Storage
We also strived to move all the Block storage to General purpose (GP3) Volume type. GP3 offers the lowest-cost SSD for various kinds of workloads.
AWS has constant efforts to optimize Storage types and keeping up with the newest types provides us with better price options and performance.
Things to Know:
- Clean up Unattached volumes
Setup Automated Instance Scale up and down depending on usage
One of our team has ML pipeline that runs on a certain time frame and uses the RDS Postgres database. Team setup Schedule terraform pipeline which Scales Instances up before pipeline execution and scales down once the Pipeline run is completed. This allows the team to leverage large instances during the run and keep it to the lowest level afterward for read operations. By designing solutions like this we can effectively optimize AWS Cost.
Auto Scaling
For EC2 usage we use Auto Scaling which allows us to increase or decrease capacity as needed and not waste compute resources. By leveraging Auto Scaling we can also change between instances types/families and reduce interruption for the end user as it uses techniques like draining and such.
S3 Buckets
S3 is the cheapest object-store. However, when we blindly write data to standard class as well as make multiple copies of Terabytes of data then the cost of S3 jumps high. Rather we must be using it efficiently by leveraging appropriate storage classes. AWS has a constant effort of reducing response time for different storage classes which allows us to move an object to a cheaper storage class and yet still have a satisfactory response time.
Our team is regularly moving all archival data to Glacier Deep Archive and has plans to update some large buckets to leverage the Intelligent tier in the near term. One of the features we leverage heavily is Life Cycle Rule to keep buckets clean by deleting temporary data and also moving objects between storage classes appropriately.
Do not jeopardize Disaster Recovery (DR) or Business Continuity Plan (BCP) just for sake of Cost Optimization and make sure to establish standard Backup policies to protect and preserve data.
Cost optimization is not one and done deal. It is something consciously that has to be considered to get go of any project.
Keeping up with the modern Instance of family and class. This provides better cost optimization as well as performance improvement.