Cut down Cost of running Google Cloud Dataflow

Reduce the cost of running a batch/streaming cloud dataflow job with these simple tricks

R.Rakesh
Analytics Vidhya
3 min readNov 27, 2019

--

Google cloud dataflow is one of the stand out products in the big data stack and one of the very powerful processing engine available, it is based on the open-source Apache beam framework and supports processing of both batch and streaming data at scale.

Its a completely managed service for big data processing at scale without needing to manage any infrastructure where it is running the pipelines, however, we do have configurations at our disposal to alter the infrastructure required for a specific batch/streaming job which can help us reduce the cost significantly.

So let’s look at some of them here.

Keeping GCP Services within the same Region

This is a very common mistake we all make while creating other GCP services. Try to keep them in the same region to avoid ingress/egress costs.

An eg. can be the source files are in a bucket which is in a different region where the dataflow job is running. This will add additional costs against network transfers, by making sure that all services are in the same region you will be able to avoid any network transfer costs as transfer within the same regions is free in almost all GCP regions.

By default, the dataflow jobs are submitted and executed in the us-central1 region if not specified in pipeline configurations.

To set the region while deploying the dataflow pipeline, you can add the below mentions parameter. Eg. — region=us-east1

Parameter detail for region

Disk Size

Default disk size for batch dataflow pipeline is 250 Gb and for streaming dataflow pipeline is 400 Gb, in most of the cases the data files won’t be stored on the cluster but rather reside on the GCS bucket in case of batch or Pub/Sub in case of streaming events making this storage attached to the cluster a wasted resource with cost associated with it.

Reduce this to the recommended minimum size of 30Gb, by doing this configuration change you will able to save almost $8–10/month/worker on batch pipelines and $15–20/month/worker on streaming pipelines.

Batch dataflow pipeline estimate (250Gb vs 30Gb PD)

To set the disk size while deploying the dataflow pipeline, you can add the below mentions parameter. Eg. — disk_size_gb=30

Parameter detail for disk size

Disable Public IP’s

By default, the Dataflow service assigns your pipeline both public and private IP addresses, the same thing happens when you create a Compute Engine VM too.

Reserving a public IP address adds to network cost and increases your monthly bills by furthermore bucks.

As in the case of dataflow pipeline if there is no requirement for you to access these pipelines from outside Google cloud you can disable this Public IP while deploying the pipeline saving a few bucks on network costs.

To disable the Public IPs while deploying the dataflow pipeline, you can add the below mentions parameter flag. Eg. — no_use_public_ips=true

Parameter detail for Public IP’s

That’s all for now! Please follow these tricks and cut down on your dataflow cost.

If this post was helpful, please leave a comment below and share it to help others find it.

--

--

R.Rakesh
Analytics Vidhya

Tech and History Enthusiast | All views and opinions are my own !!