Intro to Apache Beam

Sharathkumar hegde
Apr 23 · 3 min read

Apache beam is an open-source data processing tool which provides unifying model for both batch and streaming data-pipelines.

Beam is useful for parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. It can also be used in ETL tasks.

Apache beam is built on Scala but it supports Java, Python and GO SDKs. The beam SDKs provide unified programming model which can be used for both bounded(batch) or unbounded(streaming) data set.

Using one of the open source Beam SDKs, we can build a program that defines the pipeline. This pipeline is then translated by Beam runners into the API compatible with the distributed processing back-end of our choice.

Beam currently supports Direct runner, Apache Flink runner, Apache Spark runner, Google Cloud Data Flow runner, Apache Nemo runner, Apache Samza runner, Hazlecast Jet runner and Twister2 runner.

Direct runner is used for local testing and debugging purposes.

Components of Apache Beam

  • PCollection — represents a data set which can be a fixed batch or a stream data.
  • PTransform — a data processing operation that takes one or more PCollections and outputs zero or more PCollections.
  • Pipeline — represent a directed acyclic graph of PCollection and Transform, and hence encapsulates the entire data processing job.
  • I/O Transforms — PTransforms that read or write data.

How beam works?

  • Create an initial PCollection for pipeline data.
  • Apply PTransforms to each PCollection
  • Use IOs to write the final transformed PCollection(s) to an external source.
  • Run the pipeline using the designated pipeline runner.

Sample python code

with beam.Pipeline(options=PipelineOptions()) as p:file = '../data/kinglear.txt'output_file = '../data/output.txt'# Read the text file[pattern] into a PCollection.lines = p | 'Read' >> ReadFromText(file)split_lines = lines | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))counts = split_lines | 'PairWithOne' >> beam.Map(lambda x: (x, 1)) | beam.CombinePerKey(sum)output = counts | 'Format' >> beam.MapTuple(format_result)# Write the output using a "Write" transform that has side effects.# pylint: disable=expression-not-assignedoutput | 'Write' >> WriteToText(output_file)
  1. We are first creating the pipeline with pipeline options and all code from start to finish are mentioned inside that.
with beam.Pipeline(options=PipelineOptions()) as p:

If no runner is mentioned then by default direct runner is chosen.

2. Next, we read the input file and create lines PCollection.

lines = p | 'Read' >> ReadFromText(file)

3. This transform splits the lines in PCollection<String>, where each element is an individual word in a text.

split_lines = lines | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))

4. Next, two transforms are applied, one combines each word and other counts the each word per key.

counts = split_lines | 'PairWithOne' >> beam.Map(lambda x: (x, 1)) | beam.CombinePerKey(sum)

5. Finally, we perform IO transform where we take tuple of word and its count and store in the output text.

output | 'Write' >> WriteToText(output_file)

The output files will be available in the destination path.

Advantages of Beam

  • Portability across runtimes — Initially, if beam tasks are running in Spark runner then switching to Google Data Flow runner is extremely simple.
  • APIs the raise the level of abstraction — focus on our logic rather than the underlying details.

The code for this is available over on GitHub.

Happy Learning!.

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/. Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Sharathkumar hegde

Written by

I code for living.

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/. Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store