Building a Data Warehouse using Apache Beam and Dataflow — Part I Building Your First Pipeline

Ahmed El.Hussaini
5 min readDec 15, 2018

--

Photo by Timothy Kolczak on Unsplash

In this series of posts, I’m going to go through the process of building a modern data warehouse using Apache Beam’s Java SDK and Google Dataflow. This series will be divided into 4 parts as follows:

Note: In this article, I’ll be using Java 8, Gradle 4.10.2, and Apache Beam’s SDK 2.9.0

What is Apache Beam?

According to the definition on Beam’s homepage, Apache Beam is:

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Let’s break down this definition into multiple pieces that we will cover throughout this series.

Unified model

Use a single programming model, which consists of a unified set of classes, to build data pipelines that can process data in two modes, Batch or Streaming.

Beam’s SDK

As of this writing, Beam’s SDK is available for Java, Go, and Python.
The SDK provides all or subset of the following depending on the programming language:

  • Core APIs required to build your pipelines
  • IO classes to read from different data sources. Example: Google Cloud Storage, BigQuery, MongoDB, Text files, …
  • Connectors & Transformation APIs

Batch & streaming data-parallel processing pipelines

In its simplest form, an apache beam pipeline does the following:

- Read data from some source.
- Perform some form of processing on the data.
- Write the output to a different data source.

So what’s the difference between Batch & Steaming?

Batch pipelines are used when you’re dealing with a bounded source of data. Example:

- Reading a list of records from a relational database.
- Reading lines from a text file.

Streaming pipelines, on the other hand, are used when the source of data is unbounded. Example:

- A stream of events sent from a mobile app to the backend server.
- A stream of events fired by the backend when certain actions occur ( A new user created for instance ).

Backends (aka Runners)

Once you design your data pipeline you obviously need to test, i.e run it. Since Beam is unified by nature, it can run on multiple execution engines and will return the same output.

Here are some examples of the runners that support Apache Beam pipelines:

- Apache Apex
- Apache Flink
- Apache Spark
- Google Dataflow
- Apache Gearpump
- Apache Samza
- Direct Runner ( Used for testing your pipelines locally ).

Now that we got that out of the way, let’s design and run our first Apache Beam batch pipeline. When designing an Apache Beam batch pipeline, the first things you need to consider and plan are the following:

- Where am I reading the data from?
- What am I going to do with the data?

For our data pipeline example, we’re going to read a CSV file containing dummy user reviews data and count the number of thumbs up and down. Then we’re going to write that output to a different CSV file.

Here is a sample of the data stored in that file:

This movie is awesome,up
Exceeded my expectations,up
Terrible and insulting at the same time,down
Blown Away,up
Badly written and poorly executed,down
I enjoyed it,up

And the output of the pipeline will be the following:

up 4
down 2

Let’s get started. The code for this part of the series is available under Github here. Please clone the repo and follow the installations instructions to get started.

Here is the pipeline:

To run the above pipeline, open a terminal, cd to the directory where you cloned the code example repo and run the following command:

gradle run

You should see lines of logs, then at the end an output similar to the following:

BUILD SUCCESSFUL in 3s
2 actionable tasks: 1 executed, 1 up-to-date

You will also find one or more generated files having a name pattern similar to ratings_results-00001-of-00002 10.22.16 PM.csv. The generated files will contain the following based on the data from our sample movie reviews:

up 4
down 2

Now let’s break the code above into smaller pieces and understand what the heck just went down.

This is the starting point where you create an instance of Beam’s data pipelines. Using this instance we will define the steps to be executed when the pipeline runs.

This line looks really simple but it encapsulates a lot of Beam’s core APIs. What this line does is read the contents of a CSV file reviews.csv using the TextIO IO class which is part of Beam’s SDK. This is the first step of the pipeline, and it’s given the label “Read from CSV”.

This step is added to the pipeline using the method apply. The apply method is how you connect steps or nodes on your pipeline, here we’re adding our root node.

Each node on any Beam’s pipeline can return a type of PCollection. A PCollection<T> is an immutable collection of values of type T.

A PCollection can contain either a bounded or an unbounded number of elements. In our case, the PCollection contains strings representing each line in our CSV file.

This step is a bit complex. Actually the above are two steps or nodes. The first step takes the CSV rows coming from the root node, then uses one of the built-in transformation classes FlatMapElements to transform each row into a map containing the ratings only.

Second and last step is to count the number of occurrences of each rating and return the result in the form of type KV<String, Long>. KV is an immutable key/value pair value type provided by Beam’s SDK.

Now that we have the count of each rating in the form of Key and Value The above formats this data in a human-readable format before writing the results to a CSV file.

That’s it folks, stay tuned for Part II where we start laying the groundwork for building our Data warehouse using Apache Beam and Google’s Dataflow.

--

--