Cloud Dataflow: A Unified Model for Batch and Streaming Data Processing

Vivek Naskar
The Startup
Published in
4 min readJul 23, 2020

--

Dataflow is a fully managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service that is fully dedicated to transforming and enriching data in stream (real-time) and batch (historical) modes.

It is a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Stackdriver, which lets you monitor and troubleshoot pipelines as they are running.

Dataflow acts as a convenient integration point where Tensorflow machine learning models can be added to process data pipelines.

History

Google Cloud Dataflow was announced in June 2014 and released to the public as an open beta in April 2015.

In January 2016 Google donated the underlying SDK, the implementation of a local runner, and a set of IOs (data connectors) to access Google Cloud Platform data services to the Apache Software Foundation.

The donated code formed the original basis for the Apache Beam project.

Overview

Here’s an overview of what we know about dataflow:

  • It’s multifunctional: As a generalization, most database technologies have one speciality, like batch…

--

--

Vivek Naskar
The Startup

A software developer by the day and a writer by the night!