Maximising Cost Efficiency in the Cloud: Strategies for Optimal Cloud Cost Optimization
In today’s cloud-driven tech landscape, an organisation's biggest challenge when moving towards the cloud is its cost control. Cloud cost optimisation is an essential pillar of tech architecture. It is a generic saying that cloud cost optimisation techniques can save up to 30-40% of the cloud infrastructure cost. Governance around the same can control the cloud cost and significantly reduce expenditure while maintaining performance and reliability, which is key for our business and consumer stakeholders. This blog post will explore various approaches, techniques, considerations and practices to control cloud costs and enable businesses to maximise cloud investment.
We will look at cloud cost optimisation from 31 different aspects:
Compute Optimization
With the cloud, it is easy to provision a new compute instance and can be done on the fly, but with this power comes the great responsibility of choosing it right. Cost optimisation concerning computing can be done from different perspectives, as mentioned below:
Compute and memory optimisation
We need to choose the right-sized instances based on the need. It can be done by selecting the appropriate CPU and memory configurations per the applications or software requirements. This can be done by avoiding over-provisioning or provisioning for future needs and providing exactly what is right for today.
Instance type selection
We should consider choosing different instance types based on the application’s need and the nature of the workload it has to handle. We should do bench-marking of the services, based on the current workload x and future workload considering 2x and 5x load, in order to take this decision. Based on the load analysis and the nature of the application, we should decide what instance type to choose.
Also, we should consider alternative instance families, such as ARM or Graviton processors, which provide cost-effective options for specific workloads. While choosing an alternate family, we should consider doing a POC, running it in for non-prod workload, then only post-detailed testing and analysis we should use it for production workload.
Auto-scaling
We should always consider horizontal scaling instead of vertical scaling to cater to the increasing application demand. To achieve this, we should implement auto-scaling policies such that new instances can add dynamically based on increased demand and the instances should also detach automatically if the consumer load on the application/service goes down. This helps ensure optimal performance during peak times while scaling down during lower utilisation. This can be done based on attributes like:
- compute
- memory
- concurrency
- startup time
The application’s start-up time is inversely proportional to compute and memory attributes. For an application with a high start-up time, auto-scaling control attributes like memory can be kept less as compared to those with a low start-up time.
We should also schedule auto-scaling min-max instance attribute based on the:
- known peak load pattern/period
- planned promotion
- un-know or unplanned, but the expected payload
Doing so helps us to control leakage due to performance degrade, as otherwise, the instance will keep on adding in case of performance issue.
We should valuate the scaling parameter and min-max setting time to time as per the change in the consumer load.
Scaling considerations
We should design applications to be auto-scalable. It means that with the increase in load on the application, the new servers should get added automatically, and the server instances should scale back down with the decrease in the load.
A few important considerations to be made so that we don’t lead to cost leakage due to auto-scaling:
- Auto-scaling works on attributes like compute, memory, concurrency, etc. We should configure the attribute values so that the scale-up and down are not countering themselves. If it happens, the scale up and down can get into a recursive loop.
- Min and max instances allowed for scale should always be configured.
- For a non-prod environment, the min and max should be configured at 1. It should be kept similar to prod, only for the load-testing environment.
Auto shutdown
We should have automated scripts to shut down the non-prod resources at night or during non-productive hours and automatically bring them up on the day they start.
As a generic rule, we should do the same for the weekends.
We should also consider putting in tags for holidays, i.e. holiday-2023–06–06, so that the instance can shut down automatically on holidays.
Another scenario can be that we might require certain resources only for some specific time period, so its ideal if we have configuration which will shut it down post that time period.
Using spot instances
We can take advantage of AWS Spot Instances or Google Cloud Preemptible VMs, for a specific type of application requirement, which are non-real-time and do not have an impact if its expected work is done with a bit of delay or pause for some time.
Doing so can offer substantial cost savings for non-time-critical workloads.
Using lambda
We should consider doing the cost analysis and identifying a breakpoint RPM below which, it is beneficial to move the workload to lambda and vice-versa.
Lambda is a perfect use case for infrequent events and non-near-real workloads.
The code should run equally well on lambda and EC2 to achieve this.
While writing lambda cost, we should ensure the code is optimal and not a long-running process, as it can become costlier.
Storage Optimization
Disk size optimisation
Every computer has storage associated with it. We should analyse our disk usage patterns for different applications and resize the disks accordingly. Disk requirements can vary from application to application. Doing so will help us in avoiding oversized disks that incur unnecessary expenses.
Volume type
We should also consider cautiously while choosing which volume type we should use for which microservice as based on the business use-case and the service technical requirement, we might choose a less costly volume and save $ on it.
Choose appropriate storage options.
Considering the business and consumer requirements, we should choose the storage option per the application’s need. For example, we can opt for the most cost-efficient storage classes, such as Amazon S3 Glacier, for infrequently accessed data that requires long-term retention compared to keeping them in S3.
Retention guidelines
Every cloud service generates a lot of data which those services keep on adding to some central storage like S3, file system etc. While configuring the service, we should always define the retention policies to determine how long the data should be stored, considering regulatory requirements and consumer and business needs. Suppose data must be kept for regulatory requirements but is not actively required by consumers or businesses. Moving those data to a low-cost storage device/service via a retention policy is viable.
Unused data cleanup
Data cleanup is an ongoing process. We should regularly undertake tasks to identify and clean unused data or resources that are no longer needed. By doing this, we can free up storage space and reduce costs.
We should also consider if we should have some default retention policy applied on the data store based on the business and compliance requirement.
We can also have matrices in place to track, alert and suggest for cleanup.
Snapshot/Image backup and cleanup
We should have a regular review process to see if snapshots and images of the backups are required. We should do periodic reviews and delete unnecessary snapshots/image backups, ensuring you only retain the essential copies to reduce storage costs. Also, we should consider what retention policy should we apply, at the design phase.
Differentiate production and non-production workloads.
We should use different disk types for production and non-production environments. This is done to ensure better performance for highly concurrent production loads. The disk store type for non-production workload should be low cost as developers or testers mostly use it for internal testing.
Load testing storage type should ideally be similar to production work.
Database Cleanup
Purge and archive
We should regularly purge and archive outdated or irrelevant (from a business and consumer perspective) data from the database. This helps us reduce storage requirements and gain better performance for the application using those data stores.
Unused tables
As the application grows, we keep adding new features and removing unused or unwanted ones. While doing so, we should also remove the new tables and columns from the main database. If required, we can move those tables to the backup database before doing the cleanup.
Backup instance type and size
Delayed replication and backup instances are the least used but are important for the application stack.
We can keep the instance type and size minimal as possible so that it does not impact the purpose it is held for and, at the same time, is not inducing extra overhead from the cost perspective.
Non-prod DB instance type and size
Non-prod database instances we use for functional testing and development work can be small and cost-effective cases, and we don’t expect them to cater to a production-type load.
We can also have a load-testing database, which we can procure on demand as its size and configuration are expected to be the same as the production instance and once the load testing work is done, we can bring that instance down. So, in this case we will only for the period the load testing is in happening.
Serverless DB instance
We can check if switching to a serverless DB instance can be more cost-effective for our use case and the workload.
It can be a good option for:
- early stage startup
- non-prod workload
- less utilised db instance in the microservice ecosystem
Indexing
Indexing requirement keeps on changing due to the enhancements we do in the application. Due to this, we are required to add new indexes to the database. With new indexes coming into the picture or due to the change in the business logic, the old indexes sometimes become stale. We should do the periodic cleanup to remove these unused indexes and free up the space consumed by these indexes. This also leads to enhanced performance and extra cost savings.
Application Code Optimization
Efficient code practices
We should encourage the developers to write performant and optimised code. People sometimes think that increasing the application load can be catered to by expanding the server instances, but that is not a good solution. It will lead to cost leakage and a non-performance system.
Instead, we should focus on zeroing down the technical debt based on priority, tracking, releasing, and bench-marking performance.
By doing so, we can address the increment in the user load on the application with less or the same resource/server cost increment.
Release bench-marking
We should benchmark release based on CPU and Memory utilisation to ensure that our CPU and Memory requirement is not going up with any new release for the same user load.
If we see a breach, we should do an RCA and fix it.
Performance monitoring and profiling
We should always have monitor, profile and alert tools for the applications stack in all environments (production and non-production).
It helps us to identify performance bottlenecks and, in turn, helps to optimise code for better resource utilisation.
This should be done as an ongoing thing as part of the operational workload.
Network Optimization
Cost-aware ingress/egress
While designing an application architecture, we should always consider how much data is expected to move, both inward and outward, as it has its own associated cost. If designed and tracked properly, the organisation might retain a quantum of money around this.
We should do so by optimising the network traffic and keeping track of ingress and egress data from the network boundaries.
VPC Endpoint
We should consider using the VPC endpoint for the resources/scenarios as much as possible so that we don’t end up paying extra costs around data ingress/egress through the internet.
Regular Cleanup
Periodic resource review
Every resource in the cloud should be tagged with a business unit (BAU). Every BAU should have a cloud budget, and we should regularly monitor the up-and-running resources.
We should have rules to monitor, alarm and decommission unused resources.
POC resource tear-down
For scenarios when we must invest in resources for POC purposes, we should have processes/tags in place to mark them as short-term so they can be decommissioned.
We should also monitor resources that are underutilised or not in utilisation and plan for their cleanup/tear-down via the approval cycle.
By doing so, we can also track budgets aligned for initiatives or POCs.
Resource tagging
All this is possible only if we have different types of tags for other purposes. Then we need to tag all the resources with one or more per the need.
We should have rules to monitor, alarm and decommission unused resources.
Doing so can also help us track costs around different business units and cost centres.
License Cost
We should consider releasing the license when not in use or reusing the same license across different instances if they are expected to be in use at other examples of times.
Cautious Selection of Services
Evaluate service options
We should select the cloud service based on the business need. We should refrain from using similar or overlapping services as it might lead to cost leakage. If we have to do so while doing POC or migration, we should decommission or tear down the resource in a planned way.
Vendor comparison
We should make a hybrid cloud architecture. By doing so, we can leverage the best from different cloud worlds.
At times, a service is better available in one cloud than the other, but we can take advantage only if we have integration with both.
Disaster Recovery Strategy
Evaluate DR requirements
We should assess our disaster recovery needs based on business criticality and consumer impact. Based on this, we should define our Recover Time Objective (RTO) and Recovery Point Objective (RPO).
We should define it at the resource and service levels to ensure our DR strategy is cost-effective even without compromising resilience.
Conclusion
This article highlights the importance of cloud cost optimisation in today’s tech landscape and suggests incorporating various strategies, approaches, techniques, and considerations to control cloud costs effectively.
The article covers 31 aspects of cloud cost optimisation, emphasising that businesses can save up to 30-40% of their cloud infrastructure costs through cost optimisation practices.
Key factors include optimisation around computing, storage, network, database, application code, regular resource review, careful service selection, and disaster recovery strategy evaluation. By implementing these optimisation strategies, businesses can reduce cloud expenditure while maintaining performance and reliability.