Cloud expenditure optimization for cost efficiency

The roadmap and execution for cutting down cloud spending across Coupang engineering organizations

Coupang Engineering
Coupang Engineering Blog
5 min readMar 21, 2023

--

By Luke Travers & Amit Arora

This post is also available in Korean.

In this post, we share how the finance and engineering teams at Coupang have partnered together over the past few quarters to provide a roadmap to manage and optimize cloud expenditure. We will also detail how multiple engineering teams formed a Central team to further optimize the cloud spending for on-demand cost.

The Central team’s efforts were narrowed down to the following three principles:

  • Budget allocation and its compliance
  • Savings as the goal
  • Focus areas of credibility, sustainability, and control in line with the company’s leadership principle of “Hate Waste

Table of Contents

· Background & Challenges
· Stage 1: Forming a Central team
· Stage 2: Spending Less & Paying Less
Instance generation alignment
EMR
Storage
· Conclusion

Background & Challenges

As a company we were in a classic situation where:

  1. engineering teams were spending more than they needed on cloud, with little understanding of cloud efficiency.
  2. finance teams were struggling to understand what teams were spending on, and how to curb expenditures without impacting business growth.
  3. the leadership team did not have enough analytics into cloud spending.

To bring financial accountability to the variable spending models of cloud, the leadership team aided the engineering teams by engaging the right people to find the opportunities for efficiency and cost savings.

Stage 1: Forming a Central team

Our cloud infrastructure engineers and technical program managers collaborated as a Central team to identify a few initiatives for cloud spending efficiency.

The Central team collaborated with each domain team and helped them understand that while they are the owners of cloud usage, they must also take advantage of the cloud’s variable cost models. For instance, we helped one of our domain teams to understand their data stored in Amazon S3 and how the storage structure could be optimized at rest. Also, we shed light on how we could use tools such as AWS Spot Instances and ARM-based AWS Graviton, resulting in the dramatic cost reduction on storing and processing data. The Central team made sure that the right analytics was available, helping teams to take data-driven decisions based on the value of cloud.

The Central team understood the importance of the right analytics, tools, and processes to derive cloud efficiency as a culture across the domain teams. In that sense, we created custom dashboards using Amazon CloudWatch data processed through Amazon Athena, and we also utilized our BI (Business Intelligence) dashboards for processing AWS CUR (Cost & Usage Reports) data. The finance team also partnered with us from the other end and helped us to push forward the importance of managing the domain teams to their assigned monthly and quarterly budgets.

Stage 2: Spending Less & Paying Less

The Central team equipped with the right analytics and tools focused on optimizing in the following two interrelated yet contrasting methods:

  1. Spending Less (Expenditure Reduction by Using Less): Automating the launch of AWS resources on non-production environment on a need basis. This helped the company save 25% in costs on non-prod environments.
  2. Paying Less (Usage Reduction by Rightsizing): With the right data to analyze the usage patterns, the Central team worked closely with the domain teams across the company to manually eliminate unutilized and underutilized EC2 resources.

With usage optimization and cost savings as our main goals, the following initiatives helped save millions of dollars (On-Demand cost) in 2021 on AWS Cloud.

The optimization techniques we adopted not only helped us save in costs, but also unlocked more efficient cloud resources. Based on the best practices and recommendations from AWS, we implemented the following initiatives.

Instance generation alignment

We wanted to bring every single instance in Coupang up to the current generation for improved performance, lower cost, and higher availability. This required extensive collaboration with the domain teams to test on each and every instance type for us, as well as exploring different chip architecture such as AMD and ARM. After this arduous testing process, we successfully moved entire families of internal products onto AMD CPUs, gaining 20% in better price performance in comparison to the older versions.

EMR

We love using Spot Instances for EMR, but as we all know, it can be difficult to get ideal capacity at peak times using Spot. With our intricate and integrated EMR systems, we had to carefully ensure that each part of our toolchain was updated without causing an interruption to our service. Therefore, we had to upgrade our software versions to take advantage of the instance fleets feature of AWS EMRs. This upgrade helped us to get 25% cost reduction on total EMR costs.

Storage

In this section, we discuss how we managed to cut our AWS storage costs for EBS and S3.

  • Amazon EBS: For storage we found that there was extensive testing required to gain internal customer trust that the newer generation of EBS, GP3 would continue to meet our needs. To gain this trust, extensive performance testing was conducted using various tools. With all tests done in our development account first, we found that we could comfortably migrate 500–1000 live volumes at a time in parallel without any tangible impact.
  • Amazon S3: We moved 50+PB to Intelligent-Tiering (IT). During the process, we learned the hard way that not all workloads work well with IT, and you need to be very careful with object size. If the average object size is too low and you have multiple billions of objects, you can end up drastically increasing your overall S3 costs for that workload. In that case, the usage of S3 lifecycle filters is required to tune the policy. This is not a ‘one and done’ process but an ongoing cycle that requires extensive time, care and attention to not fall afoul of S3’s complex billing patterns.
Result of Coupang’s effort to optimize the cloud expenditure shows the reduced AWS cost against the increasing amount of Amazon S3 usage
Figure 1. Crossing trend of increasing Amazon S3 usage versus decreasing AWS cost per size

Conclusion

By adopting the methods above, we were able to minimize our AWS costs by millions of dollars (On-Demand) in 2021. A lot of the process was manually done, focusing on identifying the right optimization areas and achieving them. We are still working hard to identify additional areas for cost savings and improved efficiency.

Although we managed to save the company millions in cloud costs, we are not done yet. As next steps, we wish to invest in more complex analytic tools to drive the Cloud FinOps mindset at Coupang. Additionally, we will be automating some of the monitoring and analytics processes required for cost optimization in cloud.

If you are a passionate engineer with a deep understanding of cloud optimization and cost efficiency, join our talented team.

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.