Flink How To: A Demo of Apache Flink with Docker

This guide explains the steps of how to run a Flink application on the BDE platform.

Apache Flink is an open-source platform for distributed stream and batch processing.

In this post, we are going to see how to launch a Flink demo app in minutes, thanks to the Apache Flink docker image prepackaged and ready-to-use within the BDE platform.

A typical Flink Cluster consists of a Flink master and one or several Flink workers. A user can run an application on the cluster by submitting its executable using the run command. Such application can be scripted in either Java, Scala or Python. We’ll see, once the program has been implemented, how should the source code be packaged and submitted to Flink cluster.

The BDE platform provides a Flink master and a Flink worker images to get a Flink cluster up and running. They are both extensions of the Flink base image: a simple image that just includes Flink. Next, there is a Flink submit image available allowing to submit an application to the Flink cluster. To ease the job for the developer, we went one step further and extended the Flink submit image with a Flink Maven template.

Fig.1 below illustrates the hierarchy of Flink images provided by the BDE platform.

Fig.1 Flink Docker image hierarchy.

In order to run this demo we need Docker and Docker Compose installed. Let’s get started and deploy Flink cluster with Docker Compose. First, we need to get the latest repo.

git clone https://github.com/big-data-europe/docker-flink.git
cd docker-flink

Run the cluster:

docker-compose up -d

Now we have a running Flink cluster:

docker ps
CONTAINER ID IMAGE … NAMES
E35d8af4ba02 bde2020/flink-worker … dockerflink_flinkworker_1
b6e22370cad9 bde2020/flink-master … dockerflink_flinkmaster_1

Run an example application

You can build and launch your application in a Flink cluster by extending Flink Maven template image with your sources. The template uses Maven as build tool, so make sure you have a pom.xml file specifying all the dependencies of your application.

The Flink Maven template allows the user to package and submit his/her application with as little effort as possible. The user just needs to implement their Flink algorithm in Java/Scala and to include a Dockerfile in the root folder of the project containing the following lines:

FROM bde2020/flink-maven-template:latest
MAINTAINER Gezim Sejdiu <g.sejdiu@gmail.com>
ENV FLINK_APPLICATION_JAR_NAME flink-starter-0.0.1-SNAPSHOT
ENV FLINK_APPLICATION_ARGS "--input <path> --output <path>"

The only lines the user needs to configure are the name of the resulting JAR (FLINK_APPLICATION_JAR_NAME) and applications arguments (FLINK_APPLICATION_ARGS), if there are any.

Finally, by building and running this image, the Docker container will automatically package the application source code in a JAR and submit it to Flink cluster.

Fig.2 Apache Flink Dashboard.

Let’s run an example application. First get the latest repo of flink-starter and execute these commands:

git clone https://github.com/GezimSejdiu/flink-starter.git 
cd flink-starter

Build and run this image :

docker build --rm=true -t bde/flink-starter .
docker run --name flink-starter-app --link dockerflink_flinkmaster_1:flink-master -e ENABLE_INIT_DAEMON=false -d bde/flink-starter

By opening Flink Dashboard UI, exposed on port 8081 (as is specified on jobmanager.web.port), we can see our first Flink application running on a cluster, as is shown on Fig.2. 
The process is as follows: The JAR is built from your flink-starter application. By default, it runs a main class, which is in our case the class WordCount, with input <inputpath> if parameter — input is specified, otherwise it uses a simple collection of words and output file /usr/local/flink/output.txt (for more details of a pipeline, have a look on Fig.3).

We can check the output of flink-starter by connecting to a flink-worker (task manager) and viewing the content of the output file.

docker exec -it dockerflink_flinkworker_1 bin/bash
cat usr/local/flink/output.txt

arrows 1
be 2
fortune 1
slings 1
take 1
the 3
to 4
Fig.3 Scala WordCount Example execution Plan.

The Flink Maven template greatly simplifies the task of the user and allows to run a Flink algorithm without the need to know all the technical details of the Flink run command.

Flink/HDFS Workbench using Docker

As is known that Big Data pipeline consists of multiple components that are connected together into one smooth-running system. Given that the pipeline components are provided as Docker containers, Docker Compose offers a good solution to describe such a pipeline. A Docker Compose file contains the identifiers of all Docker containers which should run and how they will be linked together. In our example we are going to show, a Flink/HDFS Workbench Docker Compose file, which contains HDFS Docker (one namenode and two datanodes), Flink Docker (one master and one worker) and HUE Docker as an HDFS File browser to upload files into HDFS easily. Then, this workbench will play a role as for our flink-starter application to perform computations.

Let’s get started and deploy our pipeline with Docker Compose. First, we need to get the latest repo.

git clone https://github.com/GezimSejdiu/flink-starter.git 
cd flink-starter

Run the pipeline:

docker network create hadoop
docker-compose up -d

When starting a Docker Compose application, the services defined in the docker-compose.yml will start running the pipeline.

docker ps
CONTAINER ID IMAGE … NAMES
9977b7b17a44 bde2020/hadoop-namenode:1.0.0 … namenode
6487e8dd205c bde2020/hadoop-datanode:1.0.0 … datanode1
9d15ab4a41e3 bde2020/hadoop-datanode:1.0.0 … datanode2
a78e4da9254e bde2020/hdfs-filebrowser:3.9 … hdfsfb
1a7e793bdb5b bde2020/flink-master … flink-master
13346765691c bde2020/flink-worker … flink-worker

Let’s run our previous example application, but with a data loaded from HDFS. First, let’s throw some data into our HDFS now by using Hue FileBrowser running in our network. To perform these actions navigate to http://your.docker.host:8088/home. Use “hue” username with any password to login into the FileBrowser (“hue” user is set up as a proxy user for HDFS, see hadoop.env for the configuration parameters). Click on “File Browser” in upper right corner of the screen and use GUI to create /user/root/input and /user/root/output folders and upload the data file into /input folder.

Go to http://your.docker.host:50070 and check if the file exists under the path ‘/user/root/input/yourfile’.

After we have all the configuration needed for our example, let’s rebuild flink-starter by changing FLINK_APPLICATION_ARGS “ — input hdfs://namenode:8020/user/root/input/yourfile — output hdfs://namenode:8020/user/root/output/result.txt” into Dockerfile.

Build and run this image :

docker build --rm=true -t bde/flink-starter .
docker run --name flink-starter-app --net hadoop --link flink-master:flink-master \
-e ENABLE_INIT_DAEMON=false \
-e FLINK_MASTER_PORT_6123_TCP_ADDR=flink-master \
-e FLINK_MASTER_PORT_6123_TCP_PORT=6123 \
-d bde/flink-starter

We can check the output of flink-starter by connecting to a Hue File Browser and viewing the content of the output file, as is presented on Fig.4.

Fig.4 Flink WordCount Example output.

That’s it. If you have any question or suggestion, or you are planning a project with Flink, perhaps by using this flink-starter repo and Docker image? We would love to hear about it!