Trimming down the cost of running Google Cloud Dataflow at scale

Harshit Dwivedi
Nov 25 · 5 min read

Google Cloud Dataflow is one of the products provided by Google Cloud Platform which helps you ingest and transform data coming from a streaming or a batched data source.

At Roobits, we extensively use Dataflow pipelines to ingest events and transform them into desirable data that is to be used by our customers.

Dataflow is also serverless and auto-scales based on the input load, which is an added bonus to the flexibility it already provides.

Dataflow essentially requires you to write the logic that’s to be performed on the incoming events from a source (which could be PubSub, Apache Kafka, or even a file!) and then deploy that logic on Google’s servers.

Dataflow allows you to write this logic either in Java, Kotlin or Python.

A very simple example of a Dataflow Pipeline that takes an input paragraph and counts the words in it, is as follows :

While the code here might look complicated, you can go to the documentation page of Apache Beam to know more about what’s happening here.

To deploy this code on your Google Cloud Project, you can do so as follows :

java -jar wordcount.jar \
--runner=DataflowRunner \
--project=<YOUR_GCP_PROJECT_ID>

While it looks good, there are certain concerns when it comes to pricing as you plan on scaling this pipeline as it is.

Let’s look at them one by one.

Reducing the Disk size

By default, the disk size for the dataflow pipeline is set to 250GB for a batch pipeline and 400GB for a streaming pipeline.

If you are processing the incoming events in memory, this is mostly a wasted resource, so instead, I’d suggest reducing this parameter to 30GB or less (the min recommended value is 30GB but we faced no issues while running the pipeline at 9–10GB of PD)

You can do so by specifying the disk size as follows while deploying your pipeline :

--diskSizeGb=30

Now looking at Google Cloud Pricing calculator, reducing this value saves us around 20$ per month per worker.


Micro Batching your streaming pipeline

Micro batching a streaming pipeline helped us cut down on the number of writes our dataflow pipeline made into BigQuery, thereby reducing the cost of BigQuery writes.

You can look at the article below for more insights on how to do this :


Specifying a custom machine type

By default, Dataflow supports the n1 machine types for the pipeline and while these machines cover a variety of use cases, however, you might often want to use a custom machine of your own with either a powerful CPU or a large RAM.

The prebuilt machines supported by dataflow

To do this, you can add the following parameter while deploying the pipeline :

--workerMachineType=custom-8-7424

The value above would correspond to 8 cores and 7424 MB of memory and you can tweak this according to your will instead of being locked into using the presets.


Enabling Dataflow Streaming Engine

Streaming Engine is a new addition to the Dataflow family and has several benefits over a traditional pipeline, some of them being :

  1. A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs
  2. More responsive autoscaling in response to variations in incoming data volume
  3. Improved supportability, since you don’t need to redeploy your pipelines to apply service updates

As of now, the streaming engine is only available in the regions mentioned in the list here, but more regions will be added as the service matures.

To enable Streaming Engine, just pass the following flag to your pipeline execution and that’s it!

--enableStreamingEngine

Disabling public IPs

By default, the Dataflow service assigns your pipeline both public and private IP addresses.

Now if you don’t want your data to be made available to the general public, it’s a good idea to disable public IPs as that not only makes your pipeline more secure but might potentially also help you in saving a few bucks on your network costs.

Adding the following flag to the pipeline execution disables public IPs :

--usePublicIps=false

Keeping your GCP services in the same region

While it might be a no brainer for some, but I see a lot of people (including myself) paying extra for data that is transferred between the GCP services, just because they are not in the same region.

For instance, we ended up paying around 500$ in a week one of our projects, because the dataflow pipeline and the source AppEngine were in different locations (US and Europe)

Not only AppEngine and Dataflow, but a lot of GCP services have free ingress/egress from/to the same region!

To set the region while deploying your Dataflow pipeline, you can add the following execution parameter :

--region=europe-west1

The supported regions by Cloud Dataflow are listed here :


And that’s it!
Using a combination of the tips mentioned above, we were able to save a substantial amount from our spendings on Dataflow.

You can visit my Medium profile to read more blogs around Dataflow and Google Cloud; starting with this one that I wrote last week!

Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.

Have feedback? Let’s connect on Twitter.

Google Developers Experts

Experts on various Google products talking tech.

Harshit Dwivedi

Written by

Android Developer, has an *approximate* knowledge of many things. harshithd.com

Google Developers Experts

Experts on various Google products talking tech.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade