Restarting/Update Cloud Dataflow in-flight

Published in

Google Cloud - Community

3 min readMar 2, 2017

I’m not easily impressed by magic… you get used to magic when working daily with the Google Cloud Platform. But today I was impressed by the Cloud Dataflow service and it’s ability to be update your job without stopping it.

https://unsplash.com/search/stream?photo=8X1-pcDF8l0

Cloud Dataflow is Google’s managed execution engine for running data processing workflows written with Apache Beam. Our main raison to using the Beam model is the ability to write a pipeline once and run it either streaming or batch. But when you dive deeper in the model the real power lies in it’s ability to work with event time and different types of windows.

An example is the “Session Window”: with a single line in your pipeline you can create sessions from your data. The code above groups your data in sessions with gaps of 20 minutes.

https://unsplash.com/search/dam?photo=DMfSpSrhF-k

With such a powerful model for defining windows you can imagine that it’s not trivial to update a running “streaming” dataflow, or even recover for failures. You just can’t pull the plug and restart… what do you do with all the in-flight data that is already acknowledged and not written yet?! Well it turns out that Cloud Dataflow support updates.

But today we used this mechanism to give a failing dataflow a kick so it could recover from a glitch due to a BigQuery table that wasn’t created in time. The dataflow had about 8 hours of data in it’s buffers that we didn’t want to loose. The events where acknowledged so they where gone from PubSub but didn’t get written to BigQuery either, they where in the dataflow pipeline.

java -jar pipeline/build/libs/pipeline-ingress-1.0.jar \
        --project=my-project \
        --zone=us-central1-f \
        --streaming=true \
        --stagingLocation=gs://my-bucket/tmp/dataflow/staging/ \
        --runner=DataflowPipelineRunner \
        --numWorkers=1 \
        --workerMachineType=n1-standard-2 \
        --jobName=ingresspipeline-1216151020 \
        --update

Using the — update flag on the dataflow switches the provisioning to a special mode that makes sure that no data is lost. It stops the running dataflow, makes sure that the data that is already acknowledged and in the pipeline is persisted to disk (ex. sessions that are not complete) and re-attach the disks to the new dataflow and continue running.

Nothing like a real emergency with a good outcome to raise the confidence in the product. Only downside: the uptime counter is reset from 200+ days to 0. But that’s a small price to pay for not loosing any data.

Updating an Existing Pipeline | Cloud Dataflow Documentation | Google Cloud Platform

"In-flight" data will still be processed by the transforms in your new pipeline; however, additional transforms that…

cloud.google.com

Restarting/Update Cloud Dataflow in-flight

Updating an Existing Pipeline | Cloud Dataflow Documentation | Google Cloud Platform

"In-flight" data will still be processed by the transforms in your new pipeline; however, additional transforms that…

Written by Alex Van Boxel