Cloud Dataflow: A Unified Model for Batch and Streaming Data Processing

Published in

The Startup

4 min readJul 23, 2020

Dataflow is a fully managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service that is fully dedicated to transforming and enriching data in stream (real-time) and batch (historical) modes.

It is a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Stackdriver, which lets you monitor and troubleshoot pipelines as they are running.

Dataflow acts as a convenient integration point where Tensorflow machine learning models can be added to process data pipelines.

History

Google Cloud Dataflow was announced in June 2014 and released to the public as an open beta in April 2015.

In January 2016 Google donated the underlying SDK, the implementation of a local runner, and a set of IOs (data connectors) to access Google Cloud Platform data services to the Apache Software Foundation.

The donated code formed the original basis for the Apache Beam project.

Overview

Here’s an overview of what we know about dataflow:

It’s multifunctional: As a generalization, most database technologies have one speciality, like batch…

Cloud Dataflow: A Unified Model for Batch and Streaming Data Processing

History

Overview

Written by Vivek Naskar