Reading Apache Beam Programming Guide — 1. Overview

Published in

Data Engineering Space

5 min readMay 15, 2019

The data streaming process is getting more attention. We have more use cases that require low-latency data to solve problems. There are many open-source libraries to build reliable data infrastructure, including Kafka and Pulsar for durable data ingestion and Kafka Stream, Spark, Flink, Apex, and Storm for streaming data processing. The debate of data infrastructure sometimes is not only targeting how to solve the problem but also arguing which tool to use.

Apache Beam is a unique project among those streaming data processing frameworks. It is an abstraction layer on top of popular streaming processors called runners in Beam. Primarily, you write the code once, and then you have the “power” to run the same code with different streaming engines without worrying about sticking with a single framework. You can pick a framework that has functionalities that support your business logic. You may not get the newer features of those streaming frameworks the right way, but it likely is on Beam’s roadmap in the future.

To learn more about Apache Beam, I write down what I have learned from Apache Beam Programming Guide and write some codes in Java to help understand the framework better. If you want to learn Apache Beam, I hope these posts can help you.

Apache Beam is an open-source, unified model for defining both…

Reading Apache Beam Programming Guide — 1. Overview

Written by Chengzhi Zhao