Optimise Cloud Composer 1 Costs — From $188/day to $11/day

Yugen.ai

Published in

Yugen.ai Technology Blog

8 min readFeb 14, 2024

By Dharmateja Y, Rishu Roshan and Akshay Singh

Introduction

One of our teams at Yugen.ai was involved in building a data warehouse to store usage and telemetry data for ~4M users daily. Our pipeline, at a very high level, looked like the following -

Extract telemetry and usage data from different APIs and persist to object storage (Google Cloud Storage)
Various transformations, cleanups on the data are done and ingested to Google BigQuery
Aggregation pipelines to create P0 and P1 metrics that power dashboards consumed regularly by VPs and Managers.

In our exploration of Cloud Composer deployments, we uncovered several key insights into optimizing costs, ranging from version selection and control plane management to efficient data transfer strategies. Here, we share recommendations based on our experiences for new deployments aiming to achieve cost-effectiveness in Cloud Composer environments.

The content below is apt for Cloud Composer 1. Some observations may apply for 2 as well.

Challenges

Our cost estimate during the system design phase was between $40-$60/day. A couple of weeks post deployment, the actual costs were higher than $200/day. One of the biggest contributors to cost was Cloud Composer, which averaged around $188/day. We decided to take a look at the cost breakdown.

Cloud Composer 1 cost breakdown (initial deployment)

~95% of the costs came from the network data transfers. Important to note that this data transfer cost did not include the volume of data being processed by the live DAGs.

Solutions

Phase 1 — Avoid Multi-region Buckets

Our initial hypothesis based on the description of the line item was that we had DAGs which were writing data to GCS buckets which were in a different region or were multi-region buckets. This was something which we could easily verify and fix. This also seemed the most likely culprit based on some other resources & Google Documentation that we went through. We made the changes to our infrastructure by moving everything to the same region instead of multi-region config (eg. buckets were changed from multi-region to single region, write endpoints were moved to the same region etc.). After making this change, we monitored the system for a week and concluded that this reduced the Avg. Cloud Composer cost per day from $188 to $96 (a reduction of 48%).

Cloud Composer 1 cost breakdown (moving to single region bucket)

Phase 2 — Optimize Airflow Metadata DB reads

We still had some more ground to cover in terms of bringing our costs down. Further analysis showed that the network transfer out of the region coincided with Airflow’s metadata database reads. Metadata DB is a core component of Airflow which stores information such as the configuration of your Airflow environment’s roles and permissions, as well as all metadata for past and present DAG and task runs. Here’s the ERD schema of the DB.

We started looking into read patterns (where, how many times) on the metadata DB. By taking a look at how our DAG’s were defined and the schedules for them we found that part of the database read spikes also coincided with a specific DAG.

In the case of this DAG there were

6 instances of a DAG running in parallel
Each of these DAGs had multiple tasks and a list containing ~150k hex strings which were passed to downstream tasks. Though Xcoms can be used for sharing information between tasks it is best to use them sparingly.

By changing the application logic to read data from files written to GCS and just passing the location of that file we were able to significantly reduce the amount of data being written and read from the metadata database.

After this change our cost breakdown looked like this -

Cloud Composer 1 cost breakdown (optimised Airflow Metadata DB reads)

And finally, we were able to see the chart we’d been hoping for 🎉 🎈

Cost Optimisation Guidelines

Based on our experience we have some basic recommendations for cloud composer setups as well as DAG’s running on cloud composer. For new deployments of Cloud Composer, consider the following cost-saving recommendations

Control Plane Optimization: Recognize that control plane components often present challenges in terms of optimization. Focus on aspects where you have control, such as configuration choices and DAG execution patterns, to effectively manage and reduce costs.
Single-Region Deployment: Opt for a single-region deployment over a multi-region setup if costs matter more than fault tolerance.
Data Transfer Considerations: Be mindful of data transfer costs, especially associated with egress when moving data between regions or services. Minimize unnecessary data movement and assess whether it is essential to transfer information across tasks within a DAG.
DAG Design Optimization: When designing DAGs, focus on efficiency. Considerations such as parallel execution of DAG instances and the nature of data exchanged between tasks can significantly impact costs. Minimize unnecessary data transfers within DAGs using Xcoms.
Utilize Google Cloud Storage (GCS): To reduce costs related to data transfer, especially via Xcoms, store information in GCS and pass only the location details to downstream tasks. This approach can substantially cut down on egress costs and overall data transfer expenses.
Cost Monitoring and Adjustment: Regularly monitor costs and analyze billing details. Identify any unexpected cost drivers and adjust configurations accordingly. Implementing cost-aware practices can lead to ongoing savings and improved resource utilization.

Focussing on Storage costs and Logs are usually quick wins for such pipelines. Some suggesions are mentioned below -

Optimise Object Storage Costs

Optimizing storage costs is crucial for maintaining a cost-effective cloud infrastructure. In this specific case as well we had several optimizations specifically to reduce costs.

Choosing the right region: There are different storage options provided by GCP for e.g single region , dual-region or multi-region storage. Each type has different tradeoffs for e.g. multi-region storage has higher availability as compared to single region storage. However, the costs are higher especially as network egress out of the region would add a significant amount.
Choosing the right storage class: Similar to region, there are different classes of storage available as well e.g standard, nearline, coldline, archive each with its own tradeoffs as well. For example standard storage doesn’t have any minimum storage duration or retrieval fees but the other classes have these limitations

In our specific case, the amount of new data being added was ~10 TB per month. Multi region storage was not an option as network egress costs would be very high as well as having replication costs. The replication charges were $0.02 per GB, which would have added $200 of incremental cost per month. Single region storage was therefore a better alternative, especially because the availability rate of single region bucket is lower than that of multi region by only 0.05%.

Similarly we evaluated different classes but looking at the cost savings vs the tradeoffs in terms of performance as well as restrictions meant that standard storage was a good enough solution. In the future if we ever need to save more in terms of cost the plan is to move most of the old data (> 1 year) from standard to archival storage.

For storage consider the following high level recommendations:

Choose the Right Storage Class

GCP offers various storage classes, each designed for different use cases. Evaluate your data access patterns and choose the appropriate storage class

Implement Lifecycle Management

Leverage object lifecycle management to automatically transition data to lower-cost storage classes or delete unnecessary data.
If the data is being used for reporting purposes mostly, just retaining the aggregated data and deleting raw data will be helpful.
Set up time-based rules to move data from Standard to Nearline or Coldline Storage based on its age and access patterns.

Utilise Multi-Regional or Regional Storage:

Choose the appropriate storage replication option based on your availability and durability requirements.
Multi-Regional Storage provides redundancy across multiple geographic locations, while Regional Storage replicates data within a single region.
On the flip side single region storage is cheaper and does not involve network egress costs.

Logs & Cloud monitoring

Logging and monitoring is an important part of the pipeline. However, overdoing it by tracking and logging every small event in the code can also run up the bill. GCP charges you for every line of text written in the logs. By limiting the logs to important events/processes and removing redundant information or summarizing events can help you manage the costs without compromising much on the monitoring and debugging. We’ve found the following considerations to be helpful -

How new is the pipeline? — If the pipeline is relatively new and still a WIP where modifications and improvements are still being done then keeping a detailed log helps however once the pipeline has stabilized where the room for modifications and improvements are limited then logs should be reduced.
How frequently do outages/silent errors occur? — If the code breaks very frequently then having a detailed log can help as it could save time in debugging the issue, especially silent errors. However, if it’s a rare occurrence then keeping a detailed log serves no purpose.
Handling Redundant Information — In our case, each task had roughly 150 batches executed sequentially with each batch going through the same process.
Instead of logging every step of the process we logged important checkpoints.

Logging summary statistics of the process or the data rather than the raw information.For e.g. instead of logging the start time and end time of each batch, the runtime can be logged.

Conclusion

For organizations navigating the complexities of cloud cost management, especially with Managed Airflow solutions, it can be tempting to ship and release your pipelines quickly. You may not have as much flexibility and visibility as self-hosted Airflow (say in your GKE or EKS clusters) with a managed production-grade Postgres as the metadata DB. Therefore, adopting a thoughtful design and regular review of your costs can offer avenues to become operationally efficient.

At Yugen.ai, we work on creating large-scale, reliable and cost-effective data and ML pipelines for our customers. We help our client’s engineering teams build:
1. Pre-processing data pipelines
2. Online & Offline feature stores
3. Real-time ML feature ingestion pipelines
4. Infra management
amongst many other challenging problem statements. To connect with us, reach out to us at foundersdesk[at]yugen[dot]ai or connect with us on LinkedIn — Aayush Agrawal, Akshay Singh, Kumar Sanjog, Soumanta Das.