Cloud Dataflow: A Unified Model for Batch and Streaming Data Processing
Dataflow is a fully managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service that is fully dedicated to transforming and enriching data in stream (real-time) and batch (historical) modes.
It is a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Stackdriver, which lets you monitor and troubleshoot pipelines as they are running.
Dataflow acts as a convenient integration point where Tensorflow machine learning models can be added to process data pipelines.
History
Google Cloud Dataflow was announced in June 2014 and released to the public as an open beta in April 2015.
In January 2016 Google donated the underlying SDK, the implementation of a local runner, and a set of IOs (data connectors) to access Google Cloud Platform data services to the Apache Software Foundation.
The donated code formed the original basis for the Apache Beam project.
Overview
Here’s an overview of what we know about dataflow:
- It’s multifunctional: As a generalization, most database technologies have one speciality, like batch…