Writing ETL Pipelines In Apache Beam — Part 1

Arsalan Mehmood
6 min readAug 16, 2020

--

Many of you might not be familiar with the word Apache Beam, but trust me its worth learning about it. In this blog post, I will take you on a journey to understand beam, building your first ETL pipeline, branch it and run it locally.

So, buckle up your belts and let’s start the journey…!

What is Apache Beam?

Apache Beam is a unified programming model that can be used to build portable data pipelines

Apache beam is just a programming model like others to build big data pipelines but what makes it unique are two keywords, Unified & Portable.

Unified Nature

The most prominent data processing use-cases now a days are batch data processing and stream data processing in which we process historical data and real-time data streams respectively. Existing programming models like Spark and Flink have different APIs to handle both use-cases which ultimately leads to writing separate logics. But Apache Beam can handle batch and streaming data in the same way, means we have one Beam Runner API to handle both batch and streaming workloads and don’t need to write different logics separately.

In this blog post, our focus will be on historical data. Second part of it will be focused on streaming data challenges and how beam handles them

Portability

In Apache Beam, there is a clear separation of the runtime layer and the programming layer as a result of which development of code is not at all dependent on the underlying execution engine, which means code once developed can be migrated to any of the supported execution engines for Apache Beam.

It is important to mention the comparison between beam, spark and flink is invalid, as beam is a programming model and other two are the execution engines Having said that we can also deduce that the performance of apache beam is directly proportional to the performance of underlying execution engine.

Architectural Overview:

High level Architecture of Apache Beam

At high level apache beam is doing following steps,

  • We write the code in any of the supported SDK.
  • Beam runner API will convert our code into language agnostic format and if there are any language specific primitives like user defined functions, they are resolved by the corresponding SDK worker.
  • Final code can be executed on any of the supported runners.

Basic Terminologies:

Let’s discuss some of the very basic terminologies,

Pipeline: It is an encapsulation of entire data processing task. Right from reading data from source systems, applying transformations to it and write it to target systems

Pipeline Encapsulation

P-Collection: It is a distributed dataset that our beam pipeline operates on. This can be seen as equivalent to RDDs in spark. P-Collections are capable of holding bounded (historical) as well as unbounded (streaming) data.

It is important to discuss core properties of P-Collection at this point,

  • Just like spark RDDs, P-Collections in beam are immutable in nature which means applying a transformation on P-Collection won’t modify the existing P-Collection. In-fact, it will a create a new one.
  • The elements in P-Collection can be of any type but they ALL must be of same type.
  • P-Collection does not allow grained operations which means that we cannot apply transformations on some specific elements in it. Transforms will be applied to all elements of P-Collection.
  • Each element in P-Collection has timestamp associated with it. For bounded data, timestamp can either be set by user explicitly or by the beam implicitly. For unbounded data, this is typically assigned by the source.

P-Transform: It represents a data operations performed on P-Collection.

Installation:

You can install Apache beam on your local system but for that you have to install python, beam related libraries and setup some paths which can be a hassle for many people. So, I decided to leverage Google Co-lab which is an interactive environment which lets you write and execute python code in cloud. You would need google account for it.

Initial setup in Google Co-lab

It is important to remember that if the sessions is expired then you have to execute all above commands again because in new session you might not have same virtual machine as of previous session.

Awesome, So We are all set to write our ETL pipeline now…!

Writing ETL Pipeline:

Let’s say we have a source data present in file which contains data for employee id, employee name, department no, department name, present date. From this data, let’s say we want to calculate the attendance count for each of the employee for Accounts department.

attendance count for accounts department employees
  • Line 3, Creating a Pipeline object to control the life cycle of the pipeline
  • Line 5–7, Creating a P-Collection and reading bounded data from file in it
  • Line 8-11, Applying P-Transforms
  • Line 12–15-, Writing output to file and running pipeline locally
  • Line 17, printing content of output file.

There are few other things, we need to talk about at this stage,

  • | operator applied after each P-Transform corresponds to generic .apply() method in beam SDK
  • After each | we specified a label like this , ‘LABEL’ >> which gives a hint of what is being done in particular P-Transform operation. This is optional but if you choose to label P-Transforms then labels MUST BE UNIQUE.
  • Lambda function is a small, anonymous and one liner function which can take any no of parameters but can have only one expression. I usually recalls it by this general expression: (result = lambda x, y, z : x*y*z)
  • 00000-of-00001 on Line 17 is the num of shards. WriteToText() has num_shards parameter which specifies the number of files written as output. Beam can automatically set this value or we can also set explicitly.
  • Map() takes one element as input and emits one element as output.
  • CombinePerKey() Transform is grouping by key and doing sum.

Branching Pipeline:

Lets enhance our previous scenario and calculate the attendance count for each of the employee for HR department as well.

Branching Pipelines

We are effectively branching our pipeline where we can use same P-Collection as an input to multiple P-Transforms. After reading data in P-Collection, Pipeline applies multiple P-Transforms to it. In our case, Transform A and Transform B both will execute in parallel, filter the data (accounts/HR) and will calculate their attendance count. Lastly, we have an option to merge the results of both OR we can also emit their results in separate files.

Let’s extend our previous code & We will go with merging the results.

attendance count for accounts & HR department employees
  • Line 4, we are initiating Pipeline() object. This is another way to initiate Pipeline object but we have to take care of indentation in this case.
  • Line 18–22, Same as discussed above. Only filtering criteria is changed.
  • Line 25–29, Flatten() merges multiple P-Collection objects into a single logical P-Collection and then we are writing combined results to file.
  • Line 34–51, Output of the code, commented
Good Work So Far…!

Before wrapping up this first part of blog post, I would like to talk about one more P-Transform function that we will be using in Part 2,

ParDo: Takes each element of input P-Collection, performs processing function on it and emits 0,1 or multiple elements. It corresponds to the Map & Shuffle phase of Map-Reduce paradigm

DoFn: A beam class that defines the distributed processing function. It overrides the process method which contains the processing logic to run in a parallel way.

ParDo P-Transform

In part 2 of this blog post, we will talk about following,

  • Streaming data processing challenges
  • How Apache beam solves them

STAY TUNED…!

--

--

Arsalan Mehmood

Consultant — Data Analytics | GCP Certified Data Engineer, Azure Fundamentals Certified l Constantly Learning & Evolving