Basic Streaming Data Enrichment on Google Cloud with Dataflow SQL

Antonio Cachuan
Google Cloud - Community
6 min readOct 20, 2020

--

Exist many technologies to make Data Enrichment, although, one that could work with a simple language like SQL and at the same time allow you to do a batch and streaming processing, there are few and one of them is Dataflow on Google Cloud.

What is Apache Beam?

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow and Hazelcast Jet. [Github, Apache Beam]

This time we’ll be using Google Cloud Dataflow as a Runner.

Apache Beam Model

What is Dataflow?

Dataflow is a managed service for executing a wide variety of data processing patterns [Dataflow doc]. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform [Beam doc].

This means you could execute a batch pipeline with the same code as a streaming pipeline (unbounded data source or sink) without worrying about…

--

--

Antonio Cachuan
Google Cloud - Community

Google Cloud Professional Data Engineer (2x GCP). When code meets data, success is assured 🧡. Happy to share code and ideas 💡 linkedin.com/in/antoniocachuan/