Say Hello to Flink
Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
While I’m writing this, Apache Flink community released Flink® 1.3.2. I suspect they knew I write this blog today! :-P
No more talks! Let’s get our hands dirty.
Setup Your Environment
Working Java 7.x (or higher) installation
- Download a binary from the downloads page. You can pick any Hadoop/Scala combination you like. If you plan to just use the local file system, any Hadoop version will work fine.
- Go to the download directory.
- Unpack the downloaded archive.
$ cd ~/Downloads # Go to download directory
$ tar xzf flink-*.tgz # Unpack the downloaded archive
$ cd flink-1.3.0
For MacOS X users, Flink can be installed through Homebrew.
Go to Flink installed directory and enter following on your command line.
Open your browser and go to http://localhost:8081. If you get JobManager’s web front-end view like below, Congratulations! you passed the first step!
Write Your First Streaming App
Open your favorite Java IDE and create a new maven project. In pom.xml, add the following dependency.
Implement SocketWindowWordCount class as below.
You can find the complete source code for this SocketWindowWordCount example in java on GitHub.
Now build the JAR using maven.
$ mvn package
Test Your App
I will tell you how to run this example application before explaining what the heck you just wrote.
First, use netcat to start local server via
$ nc -l 9000
Then you can follow either of following methods to submit the compiled JAR as a Flink Job.
Method 1: Use Flink CLI to Submit the App
Open a new command-line interface and enter the following syntax.
$ ./<Flink_Directory>/bin/flinkrun -c SocketWindowWordCount <Path_to_Jar>.jar --host localhost --port 9000
<Path_to_Jar> according to your system.
Method 2: Use Flink JobManager’s Web UI to Submit the App
Open your browser and go to http://localhost:8081.
Click on ‘Submit new Job’ and then ‘Add New+’ to upload the JAR.
Select the checkbox of the uploaded JAR and Input
SocketWindowWordCount as Entry Class and
--host localhost --port 9000 as Program Arguments.
Click on ‘Submit’ button.
If everything was done properly you could open your browser and go to http://localhost:8081 and see a similar interface like below.
Go to the directory where you installed Flink and type the following command to monitor the output.
$ tail -f log/flink-*-jobmanager-*.out
flink : 1
is : 1
awesome : 4
Note: You need to write some text in
nc (input is sent to Flink line by line after hitting ).
What Did I Just Do?
Unless you already familiar with Flink or you are God, you have no idea what you just implemented and run. Here’s a glance on what happened.
Read text from a socket (port: 9000) and once every 5 seconds print the number of occurrences of each distinct word during the previous 5 seconds, i.e. a tumbling window of processing time, as long as words are floating in.
Now I will describe you the code we wrote.
ParameterTool provides a set of predefined static methods for reading the configuration. The tool is internally expecting a
Map<String, String>, so it’s very easy to integrate it with our own configuration style. We use this tool to get hostname and port from args.
StreamExecutionEnvironment is the context in which a streaming program is executed.
Inside the stream environment we create a
DataStream object (text) in order listen and acquire input words from netcat local server.
We use another
DataStream object, windowCounts to transform text to required form which is a stream of
WordWithCount objects. Computing windowCounts is the most important part of the program where the magic happens. So let’s discuss it further.
test undergoes three transformations to become windowCounts.
- flatMap: Takes one element and produces zero, one, or more elements. In our example, it takes each word we input and create a
- keyBy: Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning. We use this transformation to group different words into different partitions.
- reduce: A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value. We use this transformation to count the number words belonging to the same group (here the same group means one particular word). This is done every 5 seconds interval.
Finally, we print the windowCounts (completed 3 transformations on
Hope this helps to get you some understanding on how Flink works! Now it is time to comment your feedback ;-)