Restarting/Update Cloud Dataflow in-flight

Alex Van Boxel
Google Cloud - Community
3 min readMar 2, 2017

I’m not easily impressed by magic… you get used to magic when working daily with the Google Cloud Platform. But today I was impressed by the Cloud Dataflow service and it’s ability to be update your job without stopping it.

Cloud Dataflow is Google’s managed execution engine for running data processing workflows written with Apache Beam. Our main raison to using the Beam model is the ability to write a pipeline once and run it either streaming or batch. But when you dive deeper in the model the real power lies in it’s ability to work with event time and different types of windows.

An example is the “Session Window”: with a single line in your pipeline you can create sessions from your data. The code above groups your data in sessions with gaps of 20 minutes.

https://unsplash.com/search/dam?photo=DMfSpSrhF-k

With such a powerful model for defining windows you can imagine that it’s not trivial to update a running “streaming” dataflow, or even recover for failures. You just can’t pull the plug and restart… what do you do with all the in-flight data that is already acknowledged and not written yet?! Well it turns out that Cloud Dataflow support updates.

But today we used this mechanism to give a failing dataflow a kick so it could recover from a glitch due to a BigQuery table that wasn’t created in time. The dataflow had about 8 hours of data in it’s buffers that we didn’t want to loose. The events where acknowledged so they where gone from PubSub but didn’t get written to BigQuery either, they where in the dataflow pipeline.

java -jar pipeline/build/libs/pipeline-ingress-1.0.jar \
--project=my-project \
--zone=us-central1-f \
--streaming=true \
--stagingLocation=gs://my-bucket/tmp/dataflow/staging/ \
--runner=DataflowPipelineRunner \
--numWorkers=1 \
--workerMachineType=n1-standard-2 \
--jobName=ingresspipeline-1216151020 \
--update

Using the — update flag on the dataflow switches the provisioning to a special mode that makes sure that no data is lost. It stops the running dataflow, makes sure that the data that is already acknowledged and in the pipeline is persisted to disk (ex. sessions that are not complete) and re-attach the disks to the new dataflow and continue running.

Nothing like a real emergency with a good outcome to raise the confidence in the product. Only downside: the uptime counter is reset from 200+ days to 0. But that’s a small price to pay for not loosing any data.

--

--