Reading Apache Beam Programming Guide — 4. Transforms (Part 1)

Chengzhi Zhao
Data Engineering Space
8 min readJul 19, 2019

--

This blog post is part of Reading Apache Beam Programming Guide series. We are going to discuss the most data-intensive topic: Transforms! I will break this topic into three parts as the documentation demonstrates details for it.

Part 1. Transforms Overview, Core Beam Transforms (ParDo, GroupByKey, CoGroupByKey)

Part 2. Core Beam Transforms Continue (Combine, Flatten, Partition)

Part 3. Side input, additional output, and composite transforms

Github for this post: https://github.com/ChengzhiZhao/read-apache-beam-programming-guide/tree/master/src/main/java/com/beam/programming/guide/chapter4

Transforms Overview

Transforms are the core of streaming applications, which allows you to provide the business logic on what to do with the data.

Before writing transforms, below are requirements to remember and must be applied when you write transform functions in Apache Beam:

  1. serializable: this is due to the nature of the distributed system, the functions are distributed as well to multiple worker machines, and they need to be serialized across the wire to remote workers.
  2. thread-compatible. a single thread accesses the function object at a time on a worker. Unless you explicitly create your threads, in that case, you must provide your synchronization.

--

--

Chengzhi Zhao
Data Engineering Space

Data Engineer | Data Content Creator | Contributor of Airflow, Flink | Blog chengzhizhao.com