The connection between Spark Streaming and Apache Kafka using Java

NaiveTech

2 min readApr 5, 2020

What to expect ?

I will try to explain how to connect to Kafka using Spark and get the stream’s of data ready to manipulate

Assumptions and prerequisite’s

You have a brief understanding of Spark streaming and Apache Kafka
Pretty good in Java.
These are the Maven dependence's used

Lets get started, shall we ? 😃

Lets look at it this way,
Kafka is a big library and your job is to go in and find the book you want to read
There are 3 main requirements to connect to Kafka

Broker Id, Topic name’s, Group Id

Broker Id is the like an IP for the Kafka server i.e the address of the library,
Topic name’s are the books you want to read.
And Group Id is like a section in the library (say.. SCI-FY section)

Lighting the Spark 😝

Spark Config

In your main class you need to create a Spark Config just specifying the name of the application and its master.
My setMaster() in SparkConfig() is local here because I was running this in local mode. You can point to your cluster in setMaster()

Durations.seconds(STREAM_INTERVAL_SEC) means that Spark will read data from Kafka every 60 seconds. Which can be set to how much ever you want.

Kafka parameters

Here you are just setting up all the necessary values of Kafka.
ConsumerConfig is a inbuilt class of Kafka dependency which provides you various keys of Kafka.
The main point to note here is that the data in Kafka is stored as serialized key value pairs and the spark needs de-serialized values of those key value pairs, thus you also provide the KEY DESERIALIZER and VALUE DESERIALIZER

Shake hands already !!!

Spark Kafka connection

This is where Spark and Kafka connect and Kafka returns data to spark periodically, which is then converted into a stream by Spark by using ConsumerRecord (a built in class by Spark).

topicSet simply contains set of all the topics to connect.
What LocationStrategies.PreferConsistent() does is more of a technical thing, simply put Spark creates multiple nodes of same application(kinda like threads) and this function helps distribute the data evenly across all nodes.
And lastly ConsumerStrategies.Subscribe() provides all the details of Kafak to connect.

FIN 😄