Into Google Cloud Dataflow auto-scaling: Max Number of Workers

Sigal Shaharabani
Israeli Tech Radar
Published in
3 min readSep 1, 2022

One day the IT in the organization came up to us and said: You have a lot of disks, do you really need all of them?

I looked into the list and I realized that we have hundreds of 400 GB disks attached to Dataflow compute engines. Yikes. But, how could this be if we only have 11 Dataflow jobs?

Hard disk
Photo by Kina on Unsplash

Google Cloud Dataflow

Unified stream and batch data processing that’s serverless, fast, and cost-effective.

We use Dataflow for pipelines to direct data from Google PubSub to Google BigQuery and Google Storage. The system in question is used for research purposes which is one of the reasons we preferred a serverless platform over maintaining a platform on our own.

When starting a job Dataflow deploys the pipeline to compute engines. The number of compute engines is determined by how you choose to horizontally scale your application.

Horizontal scaling

The first step is to choose whether you want to enable auto-scaling or not. We started without auto-scaling — in this case you choose how many workers (e.g. compute engines) the job will use. We added to the pipeline start command the following options:

--autoscalingAlgorithm=NONE --numWorkers=3

After a few weeks we started experiencing peaks of data, which were sporadic because we can’t predict the usage pattern, so we decided to change 2 of the jobs to use autoscaling:

--autoscalingAlgorithm=THROUGHPUT_BASED

By changing the autoscaling parameter, we enabled scaling that can deal with the peaks. In this case number of workers is no longer needed, but Dataflow now required setting the maximum number of workers. Confidently I set it to 100.

--maxNumWorkers=100
How does all of this connect?
Photo by Tachina Lee on Unsplash

How does all of this relate to compute engine disks?

I didn’t think it relates to disks, but when looking into this page https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline we noticed the following paragraph:

Streaming pipelines are deployed with a fixed pool of Persistent Disks, equal in number to — maxNumWorkers. Take this into account when you specify — maxNumWorkers, and ensure this value is a sufficient number of disks for your pipeline.

By setting the maximum number of workers to 100 just in case, for 2 jobs (and later 3 jobs) we created hundreds of disks, each in the size of 400 GB.

At this point we reevaluated our scale, in most cases we are okay with the minimum of autoscaling — 7. After months of improving the pipeline and analyzing the code we realize that the peaks now are when for some reason the job is down for a while, so we set the maximum number of workers to 10 for the 3 jobs and we will continue to monitor the pipelines.

--

--