Basic Streaming Data Enrichment on Google Cloud with Dataflow SQL

Published in

Google Cloud - Community

6 min readOct 20, 2020

Exist many technologies to make Data Enrichment, although, one that could work with a simple language like SQL and at the same time allow you to do a batch and streaming processing, there are few and one of them is Dataflow on Google Cloud.

What is Apache Beam?

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow and Hazelcast Jet. [Github, Apache Beam]

This time we’ll be using Google Cloud Dataflow as a Runner.

What is Dataflow?

Dataflow is a managed service for executing a wide variety of data processing patterns [Dataflow doc]. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform [Beam doc].

This means you could execute a batch pipeline with the same code as a streaming pipeline (unbounded data source or sink) without worrying about…

Basic Streaming Data Enrichment on Google Cloud with Dataflow SQL

What is Apache Beam?

What is Dataflow?

Written by Antonio Cachuan