Google Cloud Dataflow is one of the products provided by Google Cloud Platform which helps you ingest and transform data coming from a streaming or a batched data source.
At Roobits, we extensively use Dataflow pipelines to ingest events and transform them into desirable data that is to be used by our customers.
Dataflow is also serverless and auto-scales based on the input load, which is an added bonus to the flexibility it already provides.
Dataflow essentially requires you to write the logic that’s to be performed on the incoming events from a source (which could be PubSub, Apache Kafka, or even a file!) and then deploy that logic on Google’s servers.
Dataflow allows you to write this logic either in Java, Kotlin or Python.
A very simple example of a Dataflow Pipeline that takes an input paragraph and counts the words in it, is as follows :
While the code here might look complicated, you can go to the documentation page of Apache Beam to know more about what’s happening here.
To deploy this code on your Google Cloud Project, you can do so as follows :
java -jar wordcount.jar \
While it looks good, there are certain concerns when it comes to pricing as you plan on scaling this pipeline as it is.
Let’s look at them one by one.
Reducing the Disk size
By default, the disk size for the dataflow pipeline is set to 250GB for a batch pipeline and 400GB for a streaming pipeline.
If you are processing the incoming events in memory, this is mostly a wasted resource, so instead, I’d suggest reducing this parameter to 30GB or less (the min recommended value is 30GB but we faced no issues while running the pipeline at 9–10GB of PD)
You can do so by specifying the disk size as follows while deploying your pipeline :
Now looking at Google Cloud Pricing calculator, reducing this value saves us around 20$ per month per worker.
Micro Batching your streaming pipeline
Micro batching a streaming pipeline helped us cut down on the number of writes our dataflow pipeline made into BigQuery, thereby reducing the cost of BigQuery writes.
You can look at the article below for more insights on how to do this :
Specifying a custom machine type
By default, Dataflow supports the n1 machine types for the pipeline and while these machines cover a variety of use cases, however, you might often want to use a custom machine of your own with either a powerful CPU or a large RAM.
To do this, you can add the following parameter while deploying the pipeline :
The value above would correspond to 8 cores and 7424 MB of memory and you can tweak this according to your will instead of being locked into using the presets.
Enabling Dataflow Streaming Engine
Streaming Engine is a new addition to the Dataflow family and has several benefits over a traditional pipeline, some of them being :
- A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs
- More responsive autoscaling in response to variations in incoming data volume
- Improved supportability, since you don’t need to redeploy your pipelines to apply service updates
As of now, the streaming engine is only available in the regions mentioned in the list here, but more regions will be added as the service matures.
To enable Streaming Engine, just pass the following flag to your pipeline execution and that’s it!
Disabling public IPs
By default, the Dataflow service assigns your pipeline both public and private IP addresses.
Now if you don’t want your data to be made available to the general public, it’s a good idea to disable public IPs as that not only makes your pipeline more secure but might potentially also help you in saving a few bucks on your network costs.
Adding the following flag to the pipeline execution disables public IPs :
Keeping your GCP services in the same region
While it might be a no brainer for some, but I see a lot of people (including myself) paying extra for data that is transferred between the GCP services, just because they are not in the same region.
For instance, we ended up paying around 500$ in a week one of our projects, because the dataflow pipeline and the source AppEngine were in different locations (US and Europe)
Not only AppEngine and Dataflow, but a lot of GCP services have free ingress/egress from/to the same region!
To set the region while deploying your Dataflow pipeline, you can add the following execution parameter :
The supported regions by Cloud Dataflow are listed here :
And that’s it!
Using a combination of the tips mentioned above, we were able to save a substantial amount from our spendings on Dataflow.
You can visit my Medium profile to read more blogs around Dataflow and Google Cloud; starting with this one that I wrote last week!
Overcoming the pitfalls of Google App Engine Cron
When the prebuilt services just won’t cut it!
Thanks for reading! If you enjoyed this story, please click the 👏 button and share to help others find it! Feel free to leave a comment 💬 below.
Have feedback? Let’s connect on Twitter.