In today’s big data environment, it’s necessary to deal with the constant production and consumption of data. Because data comes in real-time from different systems in varying formats, high throughput and low latency messaging tools are required. One of the most popular tools for integrating real-time data pipelines is Apache Kafka.
Thousands of companies use Kafka in cases that require high throughput of data, such as user activity tracking and IoT. Many data processing tools such as Apache Spark and Flink also provide out-of-the-box connectors for Kafka. Kafka is even used in the current centralized architecture of Streamr.
At its heart, Kafka is a scalable and fault-tolerant publish-subscribe messaging system. The main reason behind Kafka’s wide adoption is that it makes handling messaging in distributed systems easy while maintaining great performance. Basically, you save incredible amounts of time and money by adopting Kafka in your distributed system stack instead of building your own distributed messaging system.
Even after the release of Streamr’s decentralized Network, it will still be necessary to use Kafka to integrate Streamr to a good number of already existing centralized systems. This is why Streamr Labs has created an example integration to Apache Kafka with Kafka Connect. We have also created example data flows to Kafka Streams using the integration. The code repository with a detailed guide on how to set up the Streamr integrations to Apache Kafka is available here.
If you have an existing Kafka cluster, you could now use Kafka Connect to publish your data to the Streamr Marketplace for data monetisation. You could also purchase a stream from the Marketplace and use Kafka to distribute the data to your system. If you haven’t already set up a Kafka cluster, you can create it locally by following the guide in the integration code repository. Apache Kafka documentation can be found here.
With this integration, you could also build Streamr data-driven applications with Java or Scala using the Kafka Streams library. Note: Kafka Streams does not run in the Kafka cluster itself but in a separate JVM instead. This means that the Kafka Streams library is deployed in a separate application that utilises data from Kafka topics.
In general, setting up Kafka to integrate Streamr to your centralized systems is a great way to ensure that your integration is fault-tolerant, scalable, high throughput and has millisecond latencies. This will save heaps of time as you do not have to configure your custom integration to do all of this by yourself. The Kafka cluster can even be used to directly distribute the data to multiple systems in your architecture. For example, you could use this integration to easily connect to Apache Druid for data visualization, Apache Flink or Spark for real-time data analytics and Apache Cassandra for data storage.
If you’re a dev interested in the Streamr stack or have some integration ideas, you can join our community-run dev forum here. Do also read my previous blog on Apache Spark and check out the recently released Maven library for the Spark integration.