Input source reading patterns in Google Cloud Dataflow (part 2)

Pavan Kumar Kattamuri
Analytics Vidhya
Published in
4 min readMar 29, 2020

This is in continuation to my previous blog, explaining few other patterns when using Dataflow pipelines to read from Google Cloud Storage or Cloud Pub/Sub. If you are newbie to Dataflow or if you haven’t already checked the previous part 1 blog, please do so. If you have already, thank you!!

Here I have listed out few of the uncommon source reading patterns, emphasising on Cloud Storage and Pub/Sub. This does not focus on utilising different GCP services as source/sink nor on doing complex transformations, I repeat only on source reading patterns.

Note: For the below patterns, — experiments=allow_non_updatable_job should also be passed as an argument when invoking the Dataflow job else you will receive the following error “Because of the shape of your pipeline, the Cloud Dataflow job optimizer produced a job graph that is not updatable using the — update pipeline option”.

Full code can be found at my github repo

Read multiple GCS sources separately

Initially this might seem like two different dataflow jobs, but to your surprise it is not. The List GCS & Create option explained in the previous blog is useful when all the input files have the same structure/schema that can undergo similar transformations as it creates a one single PCollection of elements after ReadAllFromText. However, in cases when you want to have different PCollections for whatever reason, you can still do that with Dataflow. The below approach assumes you have a finite list of input sources to be transformed.

Read multiple PubSub sources separately

Similar to the above pattern you can also do the same with multiple PubSub sources, processing them separately.

Read from variable number of sources

Using GCS match to retrieve list of files that match the prefix pattern and dynamically we can create a PCollection per file. Since each step in Dataflow should have a unique step name, the labels for each step are also created dynamically.

When there are only two files that match the pattern, the pipeline has created two PCollections and when there are five files that match the pattern, it has created five PCollections.

Cloud Function like processing

Cloud Functions along with Cloud Storage triggers help you to achieve asynchronous, serverless & light weight processing. For example, a file has been uploaded to bucket 1 > triggers cloud function 1 > does some processing > uploads the processed file to bucket 2 > cloud function 2 triggers > does some processing again > uploads the processed file to bucket 3. What if you want to do some heavy processing which is not supported on Cloud Function due to memory & storage limitations, here comes the saviour Dataflow.

Along the same lines of the above approaches, this can be implemented using pubsub notifications on cloud storage buckets or just pubsub for that matter of fact. You can go with either of the implementations below, first approach merging & splitting pubsub messages from different topics and the second approach processing the messages from different topics separately. The second approach gives a cleaner & better visualisation of the data processing eliminating the need to merge & split stages

Implementation: Enable bucket 1 notifications to pubsub topic 1 and bucket 2 notifications to pubsub topic 2. Here’s how it goes with Dataflow: a file has been uploaded to bucket 1 > notification goes to read from topic 1 step > process bucket1 files will do some processing > write to bucket2 writes the processed file to bucket 2 > notification goes to read from topic 2step > process bucket2 files will do some processing again> write to bucket3 writes the processed file to bucket 3

The same thing, but better :)

--

--