Dataflow: Disaster Recovery

Published in

Google Cloud - Community

3 min readMay 16, 2023

Dataflow is a fully managed serverless streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. Outages in Dataflow could occur due to various reasons, including:

Hardware failures: Cloud Dataflow runs on Google’s infrastructure, which is designed to be highly reliable. However, hardware failures can still happen, such as power outages or hardware failures at a data center.
Software bugs: Software bugs can also cause outages. For example, a bug in the Dataflow scheduler could cause jobs to fail.
Network issues: Network issues can also cause outages. For example, a network outage could prevent Dataflow from accessing data or sending results to users.
DDoS attacks: DDoS attacks are another potential cause of outages. A DDoS attack is a malicious attempt to overwhelm a website or service with traffic. This can cause the service to become unavailable.
Human error: Human error can also cause outages, such as a configuration change that breaks Dataflow.

The outages can be classified into these 4 categories below:

Cloud Engine VM failure (Dataflow worker failure)
Zonal failure
Regional failure
Accidental deletion of a critical Dataflow component

Let's explore how to deal with each of those scenarios.

Cloud Engine VM failure (Dataflow worker failure)

Set up an alert in Cloud Monitoring to notify you when a Dataflow worker goes down. When a VM fails, the Dataflow service should automatically handle the loss of a machine. After the job either completes or fails, the Dataflow service automatically shuts down and cleans up the VM instances.

Zonal failure

Use Regional endpoints, which store and handle metadata about your Dataflow job, and deploy and control the workers. Leverage features like Regional placement and Auto zone placement to maximize Dataflow’s potential and create a highly resilient system. A regional endpoint configures the Dataflow worker pool to utilize all available zones within the region and zone selection is calculated for each worker at its creation time, optimizing resource acquisition and utilization of unused reservations. Automatic zone selection ensures that job workers run in the best zone for your job based on the available zone capacity at the time of the job creation request.

For streaming pipelines, use Dataflow snapshots to save the state of the pipeline. However, the following limitations apply while using snapshots:

Snapshots are created in the same region as the job. If the job’s worker location is different from the job’s region, snapshot creation fails.
Dataflow snapshots support only Cloud Pub/Sub source snapshots.
The snapshot expiration timeframe can only be set through the Google Cloud CLI.

If Dataflow is running in a single zone by using the --worker_zone flag, the job must be manually resubmitted to a different zone, as the features mentioned above cannot be used. Use the below steps for the same:

Set up alerts to get notified when a zone goes down.
Configure the source (ex., Pub/Sub or Kafka) to replay messages that are not written to persistent storage after being processed by the Dataflow job.
Note that sending an acknowledgement to the source before writing it to persistent storage is not a good application logic to follow.

Regional failure

Regional failures are rare events that can cause all your resources in a given region to be inaccessible or fail, making all jobs and clusters unavailable. Configure an alert when a regional failure occurs and resubmit the job to the DR (disaster recovery) region to start a new Dataflow job.

Pre-requisites:

All networking prerequisites should be in place and accessible for DR regions (subnets, DNS, load balancers, Private Service Connect endpoints, etc.)
Source resources must be accessible from the DR region (firewall or any restrictive policies must be checked).
There should not be any organization policy that includes the resource location constraint preventing resource creation in the DR region.

Accidental deletion of a critical Dataflow component

For stream jobs, use Dataflow snapshots to back up your data to persistent storage, depending on the criticality of the application. For batch jobs, implement logical checks after which data gets backed up by performing a raw copy.

Dataflow: Disaster Recovery

Cloud Engine VM failure (Dataflow worker failure)

Zonal failure

Regional failure

Accidental deletion of a critical Dataflow component

Written by Timothy Jabez