Integrating Streamr with Apache Spark
Decentralization of real-time data is a huge part of the Streamr project. As the global production of data grows every day, especially with the wider adoption of IoT, more peer-to-peer solutions for big data might become necessary to increase throughput. Until mature P2P MapReduce style solutions are available, we must to fall back to centralized solutions when training ML models etc for the foreseeable future. This is why we have created Streamr integration templates for one of the most used Big Data processing tools today, Apache Spark.
When building data architectures that need to be capable of handling terabytes or even petabytes of data, there are often multiple different tools required. It’s very important that the tools perform fast and scale horizontally. You might need tools such as Apache NiFi to handle the transfer of data inside the architecture effortlessly, and tools such as Apache Cassandra and HDFS to store the data for later use in ways that allow fast recovery.
You might also need tools to do computation on the data in various different points of the architecture. For example, when using machine learning you need to be able to train ML models using massive amounts of stored data, and then be able to use the trained models in real time for new arriving data sets. What if you also need windowed anomaly detection on top of the machine learning? You could learn and set up a different tool for each of these cases — or you could just use Apache Spark.
Apache Spark is an open source unified analytics engine for large-scale data processing. Originally designed to be a faster alternative for Hadoop MapReduce, it claims to be up to 100x faster. Spark’s Resilient Distributed Datasets (RDD) are the main cause of this performance increase. RDDs make it possible to do distributed computation in memory, which is the main cause of the performance increase when compared to MapReduce. In reality, because fetching all of the data in memory is unlikely, such an increase is unrealistic in most cases. However, RDDs overflow all of the data that doesn’t fit in memory to disk fault tolerantly, so you don’t have to worry about memory management. This also means that performance decreases based on the amount of data in disk.
On top of doing computation over large data sets from HDFS, Cassandra etc, Spark also provides libraries for Spark SQL (Structured data), Spark Streaming, Spark MLlib (Machine Learning) and Spark GraphX (graph processing). Importantly for Streamr, Spark SQL also has a Structured Streaming functionality. With Structured Streaming you are able to do real-time computations on your JSON formatted Streamr data in micro batches or larger windows. Decentralization of real-time data is a huge part of the Streamr project.
Streamr integration templates are provided for Java and Scala. For Java there is only a Spark Streaming direct integration template using Sparks CustomReceiver abstraction. Currently, there are no examples provided for direct publishing of data because Streamr’s Java client does not work outside of Spark’s executor. If you wish to publish data directly from Spark to Streamr you need to make direct calls to Streamr’s Data API.
For Scala there are more examples for integrations with Streamr. Direct integration examples, similar to the Java template, are provided for Structured Streaming and Spark Streaming in Scala. Here is the example code for a Streamr subscription custom receiver for Spark in Scala:
There is also a Node.js script that allows you to pull historical and real-time data from Streamr to file systems, and a Scala template on how to process the data of the created JSON files with Spark SQL. An example of how to use Apache NiFi to pull data from Streamr and push it via Kafka to Spark to be analysed is also provided. You can also do the process in reverse and publish the analysed data back to Streamr. It would also be possible to use Spark’s Python version PySpark to pull data to Spark directly from NiFi with this out-of-the-box NiFi processor.
Once Streamr’s Java client is in a more stable state, we may publish Streamr’s custom Spark receivers and sinks to Maven. This would make integrating Streamr to Apache Spark a lot easier as you would only need to declare the Maven library as a dependency in Spark’s start up script without any workarounds. For now, the required steps on how to start using Streamr data with Apache Spark in Java or Scala are documented in the integrations’ repository.
One glaring hole in Spark’s streaming libraries is that the data processing isn’t actually done in real-time for single events in a stream. Instead, the Spark Streaming and Structured Streaming do their computation in micro batches. This means that you will often see multiple outputs, because the micro batches tend to contain more than one data point. This is where Apache Flink comes in as the better candidate for real-time data processing. Apache Flink does its real-time processing in events, so each arriving data set gets processed exactly once in its own window. We have also done an integration to Apache Flink which will published soon.
If you’re a dev interested in the Streamr stack or have some integration ideas, you can join our community run dev forum here.