Cloud Journey: Taming The Cost Dragon

Saurabh Shashank
Walmart Global Tech Blog
4 min readSep 20, 2023

Cloud cost must be one of the most talked about subjects in Tech corridors. As organisations are well invested in cloud infrastructure, the cloud cost has started haunting operational costs. Experimentations, easy to spin & use, improper IAM (Identity & Access Management), and no monitoring are just the tip of the iceberg which leads to excessive costs. The organisation with a cloud roadmap will not win the cloud race but one with the cloud cost roadmap will.

Our journey with the cloud has been no different. Being the International Data Team (IDT), we have hundreds of clusters spinning, churning & storing terabytes of insights every hour.

We went into war mode and talked about various plans of attack. We broadly classified these strategies as:

  • Identify and Patch Leaks
  • Clean and Decommission
  • Inhouse Tool Development
  • Initialization Scripts

The plan was to execute the above strategies to below three tracks in parallel.

  1. Compute Track
  2. Storage Track
  3. Orchestrator Track

All Implementation of strategies are for Google Cloud

Long Running Cluster Report (LRCR): An in-house developed tool to identify & report cluster runtime, Running Jobs, cost, and others

LRCR is programmed in Python using airflow Google operators, google Python SDK (software development kits) and scheduled in Airflow. Cluster Util lists cluster with all KPIs (Key Performance Indicators) for provided Google team space.

Estimate cost is the basic arithmetic logic block to compute cost as a function of the master, worker, and secondary workers.

Generating HTML Table for mailer report.

We have implemented the above kill switch version of LRCR in all our Dev projects. Kill switch auto deletes clusters running idle after a certain lifetime.

House- in- order: We utilized various in-house tools and GCP (Google Cloud Platform) Monitor to identify dormant projects, storage, compute, and VMs (Virtual Machines). Decommissioning and clean-up drive was kicked off. Tens of VMs, single-digit projects, and 300TB plus storage were deleted. We downgraded VMs, archived files, and optimized Dataproc clusters. List of few helpful cloud metrices below:

Orchestrator: Airflow is used to schedule all the jobs by our team. We identified instantiating tasks as taking 2 minutes which meant compute was billed for 2 minutes in time. The configuration was changed. to bring down the instantiating time by half. Clean-up drives helped us remove scheduled jobs in development deployments that were leaking cost on hourly basics.

Effective computing: It was introduced by the team. We focused on creating cost-effective computing every time in the future. Initialization script created with the following:

Cluster Scheduled Deletion
Idle cluster delete time, LRCR Kill switch helped us achieve significant cost saving for Dataproc compute clusters

Machine type for specific cases
The decision was to move away from high cost N1 machines to comparatively cheaper N2D & E2 machine based on specific use cases.

Effective storage for respective Job families (Extract, Transform & Load)
Reduce default boot disk size to 200 from 500 GB.

High Availability Mode
Development, non-SLA & batch jobs were taken off from High available clusters to single master node cluster.

Conclusion

All the above strategies and the implementation of all the tracks have helped bring down the cost graph and keep it stable. We as a team were successful to change the graph from fire breathing dragon to a sleeping dragon. Cost saving is a continuous journey with timely milestones. we have hit a couple of goals and continue to work on cost-saving 2.0.

Happy Reading

--

--

Saurabh Shashank
Walmart Global Tech Blog

Staff Data Engineer, International Data @Walmart Global Tech