Apache Beam is a unified programming model to create Batch and Stream data processing pipelines. Simplifying a bit, it's a Java SDK that we can use to develop analytics pipelines, such as for counting users or events, extracting trending topics, or analyzing user sessions.
In this post, we are going to see how to launch a demo of Apache Beam in minutes, thanks to a Docker image with Apache Flink and Beam pre-packaged and ready to use. As a bonus, the repo used to create this demo is also available on GitHub to get started creating Beam pipelines.
Deploy Flink & Beam with Docker
Let's get started and deploy a Flink cluster with Docker Compose.
First get the latest repo — in fact, all we need is the docker-compose.yml file:
git clone https://github.com/ecesena/docker-beam-flink.git
Run the cluster (we can also scale it up or down):
docker-compose up -d
docker-compose scale taskmanager=2
That's it! We now have a running Flink cluster:
docker psCONTAINER ID IMAGE ... NAMES
3d59d952d152 beam-flink ... dockerbeamflink_taskmanager_2
4cce6219be80 beam-flink ... dockerbeamflink_taskmanager_1
3b7b6b32b4de beam-flink ... dockerbeamflink_jobmanager_1
The launch script also uploads a custom JAR with Beam pre-packaged and ready to use. For more technical details on the cluster, refer to the repo.
Run HelloWo — ehm, WordCount
Let's now run our first Beam pipeline.
Open Flink web UI, exposed on port 48080. For example, on Mac or Windows:
open http://$(docker-machine ip default):48080
Then follow these steps:
- Click “Submit new Job” in the left menu — we'll see beam-starter-0.1.jar pre-uploaded
- Flag the checkbox near beam-starter-0.1.jar
- Click on “Submit” (or “Show Plan”). No additional parameter is needed.
Congratulations, we have now run our first Beam pipeline!
Here's what happened under the hood. The JAR is built from this starter repo. By default it runs the class WordCount, with input file /tmp/kinglear.txt and output file /tmp/output.txt. The input file is also pre-loaded in the Docker image, so all Flink task managers can read it.
We can check the result of WordCount by connecting to a task manager and looking at the content of the output file. Note that the output file name starts with /tmp/output.txt as multiple files may be created, depending on the pipeline:
docker exec -it dockerbeamflink_taskmanager_1 /bin/bashcat /tmp/output.txt*
Build a Beam Pipeline
Finally, let's look at how to build our own JAR file.
The repo used for this WordCount demo is based on Beam documentation for the Flink runner, with minor changes to overcome some imprecision. This repo is a good starter to build any Beam pipeline.
Just clone the starter repo and build it:
git clone https://github.com/ecesena/beam-starter
mvn clean package
The last command will create a JAR file within the target/ directory. We can upload the JAR via the Flink web UI and run it as we saw above.
From here, it should be easy to tweak the file WordCount.java and create other Beam pipelines.
As a concrete example, I've used Beam to analyze trending topics on Twitter during the Oscars. And you? Are you planning a project with Beam, perhaps by using this starter repo or Docker image? I'd love to hear about it!