Apache beam is an open-source data processing tool which provides unifying model for both batch and streaming data-pipelines.
Beam is useful for parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. It can also be used in ETL tasks.
Apache beam is built on Scala but it supports Java, Python and GO SDKs. The beam SDKs provide unified programming model which can be used for both bounded(batch) or unbounded(streaming) data set.
Using one of the open source Beam SDKs, we can build a program that defines the pipeline. This pipeline is then translated by Beam runners into the API compatible with the distributed processing back-end of our choice.
Beam currently supports Direct runner, Apache Flink runner, Apache Spark runner, Google Cloud Data Flow runner, Apache Nemo runner, Apache Samza runner, Hazlecast Jet runner and Twister2 runner.
Direct runner is used for local testing and debugging purposes.
Components of Apache Beam
- PCollection — represents a data set which can be a fixed batch or a stream data.
- PTransform — a data processing operation that takes one or more PCollections and outputs zero or more PCollections.
- Pipeline — represent a directed acyclic graph of PCollection and Transform, and hence encapsulates the entire data processing job.
- I/O Transforms — PTransforms that read or write data.
How beam works?
- Create a pipeline object ans set the pipeline execution options, including the pipeline runner.
- Create an initial PCollection for pipeline data.
- Apply PTransforms to each PCollection
- Use IOs to write the final transformed PCollection(s) to an external source.
- Run the pipeline using the designated pipeline runner.
Sample python code
Below is the sample python code which counts the words in a text.
with beam.Pipeline(options=PipelineOptions()) as p:file = '../data/kinglear.txt'output_file = '../data/output.txt'# Read the text file[pattern] into a PCollection.lines = p | 'Read' >> ReadFromText(file)split_lines = lines | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))counts = split_lines | 'PairWithOne' >> beam.Map(lambda x: (x, 1)) | beam.CombinePerKey(sum)output = counts | 'Format' >> beam.MapTuple(format_result)# Write the output using a "Write" transform that has side effects.# pylint: disable=expression-not-assignedoutput | 'Write' >> WriteToText(output_file)
- We are first creating the pipeline with pipeline options and all code from start to finish are mentioned inside that.
with beam.Pipeline(options=PipelineOptions()) as p:
If no runner is mentioned then by default direct runner is chosen.
2. Next, we read the input file and create
lines = p | 'Read' >> ReadFromText(file)
3. This transform splits the lines in
PCollection<String>, where each element is an individual word in a text.
split_lines = lines | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
4. Next, two transforms are applied, one combines each word and other counts the each word per key.
counts = split_lines | 'PairWithOne' >> beam.Map(lambda x: (x, 1)) | beam.CombinePerKey(sum)
5. Finally, we perform IO transform where we take tuple of word and its count and store in the output text.
output | 'Write' >> WriteToText(output_file)
The output files will be available in the destination path.
Advantages of Beam
- Unifying both batch and streaming API under single API — With minimal code changes, we make same code work for both streaming and batch data-pipeline.
- Portability across runtimes — Initially, if beam tasks are running in Spark runner then switching to Google Data Flow runner is extremely simple.
- APIs the raise the level of abstraction — focus on our logic rather than the underlying details.
The code for this is available over on GitHub.