Comparison Scala API for Apache Kafka

With the demand for processing large amounts of data, Apache Kafka is a standard message queue in the big data world. Apache Kafka is publish-subscribe-messaging rethought as a distributed, partitioned, replicated, commit log service, and it has a lot of convenient APIs for many languages.

In this article, I would like to share my experience with leveraging Kafka’s API for multiple purposes — from consuming and writing data to streams to a more reactive approach with Akka. In this tutorial, all examples are written in Scala. If you use another programming language, you can easy remake code from Scala.

First of all, you need to install Kafka. For this, I use a Docker image:

Of course, you can use another image or launch Kafka manually; it’s up to you.

Integrating Spark Streaming and Kafka is incredibly easy. Your middleware, backend (proxy-like), or IoT devices can send millions of records per second to Kafka while it effectively handling them. Spark Streaming provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. Primarily, we need to set up Kafka’s parameters to Spark — like a host, port, offset committing strategy, etc.

After setting the necessary configurations, we can deal with the direct stream. All logic with creating streams is located in the KafkaUtils class:

Note: The code above uses the Spark Streaming API, which we will discuss below.

Spark operates with RDD (the basic abstraction in Spark; represents an immutable, partitioned collection of elements that can be operated in parallel). All RDD in a specific batch (represented by parameter heartbeat.interval.ms) can be manipulated with the foreachRDD method. In this example, we simply print names of the topic, offset, and partition. You can also use more complicated logic, like retrieving data from a stream of tweets.

Kafka provides three different ways to warranty fault tolerance behavior. The first is checkpointing. Spark specification says:

“A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures. There are two types of data that are checkpointed.”

In code, it looks as follows:

I call the second strategy the Kafka itself strategy. Kafka has an offset commit API that stores offsets in a special Kafka topic. By default, the new consumer will periodically auto-commit offsets. After output from Kafka is consumed by the streaming, you can commit offsets to Kafka using the commitAsync API. Kafka is not transactional, so your outputs must still be idempotent. In code it looks as follows:

And the last strategy is using its own data store. Yes, you can use storage like RDBMS or ZooKeeper for storing offsets — this is a very popular solution. It gives the equivalent of exactly-once semantics. Applying this strategy is especially useful in situations when it’s hard to make idempotent logic with complicated aggregation:

If you want to read data between certain offsets, you can simply obtain RDDs that represent content in this range in the topic:

Pay the attention to the different ways to acquire createStream and createDirectStream. You can read more about the differences between them here. This is an important concept; you must distinguish use cases for them.

Also, Kafka provides seamless integration with binary protocols like Avro and Protobuff. Integration of Apache Spark with Kafka and Avro can be organized in a separate module, so include it as on-demand (usage of Tweeter’s bijection simplifies code with transforming):

Spark Structured Streaming API

Spark structured streaming is one of the most exciting ideas presented in Apache Spark. It’s the next step in the process of developing Spark Streaming. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. Structured streaming gives the ability to build ETL pipelines in a very clear way. As a result, the code is similar to the Java 8 Stream API, so if you don’t know Scala but know Java, it will not be difficult for you to understand what is happening:

Writing Data to Kafka Stream

There is a third-party library for this not-so-standard task. This tool provides simple API for writing data to the stream. The next example shows how to read data from the socket and write it to the stream:

Akka Streams

Akka Streams Kafka, also known as Reactive Kafka, is an Akka Streams connector to Apache Kafka. Akka Streams allows you to write data to Kafka topics via a Sink API:

And exactly the same via a Flow API:

Consuming data with Akka Streams is very clear, you can build sophisticated data flows with Graph DSL where Kafka will part of it:

Like in the example with Apache Spark, you can save offset in a database or in ZooKeeper:

Akka Actors

Akka gives you the opportunity to make logic for producing/consuming messages from Kafka with the Actor model. It’s very convenient if actors are widely used in your code and it significantly simplifies making data pipelines with actors. For example, you have your Akka Cluster, one part of which allows you to crawl of web pages and the other part of which makes it possible to index and send indexed data to Kafka. The consumer can aggregate this logic. Producing data to Kafka looks as follows:

Consuming messages is obvious — you set a supervisor strategy for handling messages and write the logic for incoming record in the receive method:

Since version 0.10 Kafka supports SSL/TLS, I strongly recommend you to use encryption everywhere in a production environment. The configuration of keys and certificates in multiple locations is a routine task, so I collected all necessary scripts and configuration for this task here.

That’s all. You can find a full listing of the source code on my GitHub repository.