Streamr integration with Apache Flink

Published in

Streamr

6 min readNov 27, 2019

It’s clearer now, more than ever before, that there’s no expiration on the value of data. It’s because of this that companies hoard data if there’s the slightest chance it can be utilised in the future. Despite this fact, the value of data is at its greatest the closer it is to its point of creation. For example, when dealing with monetary transactions, it’s so much better to carry out fraud prevention instead of just detecting it after the fact.

This means that you’d most likely want to process each new transaction as its own real-time event instead of undertaking micro-batch processing or even waiting for the data to reach a database after the transaction. Nevertheless, you wouldn’t want to throw away important data analysis tools such as windowing. It would be even better if you could customise the windows based on time, number of events, or even some combination of the two. Apache Flink allows you to do all of this with horizontal scalability, to fit any of your needs.

Earlier this year, we released a blog post and a repository for Streamr integrations to Apache Spark. The Spark integration can now also be found as an out-of-the-box integration library in Maven Central. In that blog post, I touched on why Apache Flink is a better real-time streaming data processing engine than Spark.

Here I am going into a little more detail. Apache Spark and Flink are both Big Data processing systems that are basically in-memory databases that allow you to do very fast queries on data (because most data is stored in the RAM). Both guarantee low latencies, high throughput and ‘exactly-once’ processing (different from an at-least-once or at-most-once approach). Both also have libraries for graph processing and machine learning, although Spark’s libraries are more comprehensive due to the larger contributor base. However, there might be changes to this in the near future as the Chinese giant Alibaba has widely adopted Flink. Alibaba has already dedicated teams to further develop the open-source codebase of Apache Flink.

Most differences between Spark Streaming and Flink are because streaming functionalities were added to Spark as an afterthought, whereas Flink was built for streaming from day one.

Because of this, Spark Streaming is only capable of doing its computations in micro-batches whereas Flink is real-time event-based. So if you require true real-time results from your data processing engine, picking Flink over Spark is a fairly obvious choice.

But it isn’t all in Flink’s favour. In many scenarios, Spark is simply faster than Flink, especially with larger volumes of data. In terms of data ingestion, Flink is faster but due to Spark being able to process the windows in parallel, the overall processing time tends to favour Spark. You can find a nice visualisation and more detailed explanations about the differences in this blog post.

Chart source: The Beam Model, via Rafael Martinez

In most cases, you get more accurate results with Flink’s event-based windowing when compared to Spark’s time-based only windowing. Spark is also faster when processing large graphs. Conversely, Flink is faster when processing smaller graphs.

Flink also seems to scale better with the number of nodes in a cluster. You can check out more performance comparisons between Spark and Flink in this or this more recent paper.

Some additional benefits of Flink are that it is possible to utilise durable application state saves. This means you can process reruns of your data if it turns out that your machine learning model needs tweaking. You can also use these snapshots of the state in other parts of your architecture if needed.

Flink also allows you to do iterations and delta iterations. Iterations are often used in machine learning and graph processing; you are able to use or check an iteration’s result as a solution. For Spark you have to use loops outside the system to achieve similar results. You can also deploy Apache Storm topologies in older versions of Flink (1.7 and under).

Streamr Labs has created Java and Scala integration templates between Apache Flink and Streamr. The templates use the streamr flink integration library created by Streamr Labs. The capability to pull historical data from Streamr, based on timestamps and message counts, will be included to the library in the future.

If you are new to Apache Flink you might want to start with the integration template repository. You should also check out Flink’s documentation to gain a deeper understanding of how Apache Flink works. Comprehensive guides on how to set up Flink locally on Linux, Mac or Window can be found in the setup section of Flink’s documentation.

The templates are preconfigured to be buildable and runnable with IntelliJ IDEA. So you can simply clone the repository and open the template with your preferred programming language in IDEA. You only need to add your Streamr API key and your subscribe and publish stream IDs to get the template code running. The Streamr API key and stream IDs can be found in Streamr’s Core app.

To use the Helsinki tram demo Marketplace stream that’s used in the templates, go to its Marketplace page, then add it to your purchases (it’s free of charge). You should now be able to see the “Public Transport Demo” stream on your streams page. Next, check if the stream is receiving data by looking at the “Preview” section inside the tram stream details page (the stream might not receive any data between 10PM–2 AM UTC because the trams don’t run at night).

After the tram stream is set up, you should create a new stream for publishing the data coming out of Flink. Use the tram stream’s ID to subscribe to data and the new empty stream’s ID to publish the data.

After Streamr data is flowing through the template successfully, you can start playing around with different aggregations or even Flink’s ML libraries. The flink_streamr integration library hands you the Streamr data as Java maps. This makes handling the data easy; you can simply use the names of the data fields as keys in the map. When publishing data back to Streamr, the data should also be in Map<String, Object> format.

If you are already familiar with Flink and have your own preference on how to set up and run your projects, you can easily include the Streamr Integration to your project by adding this snippet to your pom.xml file:

<dependency>
  <groupId>com.streamr.labs</groupId>
  <artifactId>streamr_flink</artifactId>
  <version>0.1</version>
</dependency>

Then you can start using the connectors by adding these lines to your Flink program:

import com.streamr.labs.streamr_flink.StreamrPublish;
import com.streamr.labs.streamr_flink.StreamrSubscribe;final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();StreamrSubscribe streamrSub = new StreamrSubscribe(
       "YOUR_STREAMR_API_KEY", "YOUR_SUB_STREAM_ID");DataStreamSource<Map<String, Object>> tramDataSource = env.addSource(streamrSub);StreamrPublish streamPub = new StreamrPublish(
       "YOUR_STREAMR_APIKEY", "YOUR_PUB_STREAM_ID");tramDataSource.addSink(streamrPub);

After everything is set up, you can start doing data aggregations, data analysis, or training machine learning on your own Streamr data, or any streams that you have purchased access to in the Marketplace.

Streamr integration with Apache Flink

Written by Santeri Juslenius