Cut down Cost of running Google Cloud Dataflow
Reduce the cost of running a batch/streaming cloud dataflow job with these simple tricks
Google cloud dataflow is one of the stand out products in the big data stack and one of the very powerful processing engine available, it is based on the open-source Apache beam framework and supports processing of both batch and streaming data at scale.
Its a completely managed service for big data processing at scale without needing to manage any infrastructure where it is running the pipelines, however, we do have configurations at our disposal to alter the infrastructure required for a specific batch/streaming job which can help us reduce the cost significantly.
So let’s look at some of them here.
Keeping GCP Services within the same Region
This is a very common mistake we all make while creating other GCP services. Try to keep them in the same region to avoid ingress/egress costs.
An eg. can be the source files are in a bucket which is in a different region where the dataflow job is running. This will add additional costs against network transfers, by making sure that all services are in the same region you will be able to avoid any network transfer costs as transfer within the same regions is free in almost all GCP regions.
By default, the dataflow jobs are submitted and executed in the us-central1 region if not specified in pipeline configurations.
To set the region while deploying the dataflow pipeline, you can add the below mentions parameter. Eg. — region=us-east1
Disk Size
Default disk size for batch dataflow pipeline is 250 Gb and for streaming dataflow pipeline is 400 Gb, in most of the cases the data files won’t be stored on the cluster but rather reside on the GCS bucket in case of batch or Pub/Sub in case of streaming events making this storage attached to the cluster a wasted resource with cost associated with it.
Reduce this to the recommended minimum size of 30Gb, by doing this configuration change you will able to save almost $8–10/month/worker on batch pipelines and $15–20/month/worker on streaming pipelines.
To set the disk size while deploying the dataflow pipeline, you can add the below mentions parameter. Eg. — disk_size_gb=30
Disable Public IP’s
By default, the Dataflow service assigns your pipeline both public and private IP addresses, the same thing happens when you create a Compute Engine VM too.
Reserving a public IP address adds to network cost and increases your monthly bills by furthermore bucks.
As in the case of dataflow pipeline if there is no requirement for you to access these pipelines from outside Google cloud you can disable this Public IP while deploying the pipeline saving a few bucks on network costs.
To disable the Public IPs while deploying the dataflow pipeline, you can add the below mentions parameter flag. Eg. — no_use_public_ips=true
That’s all for now! Please follow these tricks and cut down on your dataflow cost.