Spark Chapter 12 Spark with Apache Kafka

Kalpan Shah
Plumbers Of Data Science
9 min readAug 4, 2023

Spark Stretured streaming with Apache Kafka

Previous blog/Context:

In an earlier blog, we discussed Spark ETL with Lakehouse & Lakehouse optimization. Please find below blog post for more details, if you have not checked yet.

Introduction:

Today, we will discuss the points below.

  • What is Apache Kafka?
  • Basic concepts of Apache Kafka (Publisher and Subscriber)
  • Publish and subscribe messages from the command line
  • What is Spark structure streaming?
  • Spark structure streaming using Apache Kafka

In this blog, we will first understand the basic concepts of Apache Kafka and Spark structure streaming, and then with practical examples, we will do Spark structure streaming using Apache Kafka.

Spark ETL with Apache Kafka (Image by Author)

We will be learning all of the above concepts by doing the below hands-on.

  1. Create Apache Kafka Publisher, create the topic, and publish messages
  2. Create Apache Kafka Consumer and subscribe topic and receive messages
  3. Create a Spark session & install the required libraries for Apache Kafka
  4. From the Spark, session subscribe earlier created topic
  5. Stream messages into the console
  6. Write streaming messages into the files (CSV or JSON or Delta format)
  7. Write streaming messages to the database (MySQL or PostgreSQL or MongoDB)

The first clone is below the GitHub repo, where we have all the required sample files and the solution

If you don’t have a setup for Spark instance follow the earlier blog for setting up Data Engineering tools in your system. (Data Engineering suite will setup Spark, MySQL, PostgreSQL, MongoDB, and Apache Kafka in your system) In that Spark instance, we already have packages installed for Azure blog storage and Azure Data Lake Services.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform developed by the Apache Software Foundation. It is designed to handle high-throughput, real-time data streams and provides a scalable, fault-tolerant, and reliable infrastructure for building data-driven applications.

The main concepts of Apache Kafka include:

  1. Topics: Kafka organizes data into topics, which can be thought of as feeds or categories to which records are published. Topics are partitioned to allow for parallel processing and high throughput.
  2. Producers: Producers are applications that produce (write) data into Kafka topics. They send messages or records to Kafka, which are then stored in the topic.
  3. Consumers: Consumers are applications that read data from Kafka topics. They subscribe to one or more topics and receive messages from those topics.
  4. Brokers: Brokers are the servers that manage and store data. When a producer sends data to a topic, the broker is responsible for storing it. When a consumer reads data from a topic, it pulls data from the broker.
  5. Partitions: Topics can be divided into multiple partitions to achieve scalability and parallelism. Each partition is an ordered, immutable sequence of records.
  6. Consumer Groups: Consumers reading from a topic can be organized into consumer groups. Each message within a partition can only be consumed by one consumer within a group, allowing for load balancing and parallel processing.

Kafka’s architecture is designed to be highly scalable and fault-tolerant. Data is persisted to disk and replicated across multiple brokers to ensure data durability and availability. This design makes Kafka suitable for various use cases, such as real-time event processing, log aggregation, data integration, and streaming analytics.

For more details follow the below links from the web

What is Spark structure streaming?

Spark Structured Streaming is a component of Apache Spark that enables real-time stream processing with a high-level, declarative API. It allows you to process data streams in a manner similar to batch processing, making it easier for developers familiar with Spark’s batch processing API to work with real-time data streams.

Structured Streaming provides an abstraction called “DataFrames” and “Datasets,” which are distributed collections of data organized into named columns. Under the hood, these DataFrames and Datasets represent an unbounded stream of data, which means they can continuously receive data as it arrives.

The overall architecture of Structured Streaming is based on micro-batch processing. It periodically groups the incoming data into small batches, processes them as Spark batch jobs, and then updates the output sink. This abstraction simplifies stream processing by leveraging Spark’s batch-processing capabilities.

For more details on Spark Structured streaming follow the below links

Now, we have a basic understanding of Apache Kafka and Apache Spark Structure streaming, we will do hands-on using the below scenarios.

Create Apache Kafka Publisher, create the topic, and publish messages

Once you cloned the GitHub repo for the Data Engineering suite. Create an image and start docker container using docker-compose. We will see the containers below running.

Data Engineering Suite Docker Containers (Image by Author)

We will go inside the Kafka container. (you can use the docker terminal or can open it in the external terminal)

I am opening it in the external terminal, you will see the below screen.

Inside Kafka Docker conatiner (Image by Author)

We will use the below command to list if there are already any topics are there.

It will list empty as we have not created a topic yet.

List Kafka topics (Image by Author)

Now, we will create our first topic for publishing messages. (And in the future whoever subscribes to our topic will receive the message immediately when we publish messages on that topic). For understanding this concept, we will consider our publisher as one news company and the topic is a particular new category.

For example, our news company is “XYZ” and the news category is “technology”. In the below example, we are creating our topic with the name “News_XYZ_Technology”

Once we execute this command, we will see a message that the topic is created.

Create new Kafka topic (Image by Author)

Now our news company is having one more category which is fashion and for that topic, the name is “News_XYZ_Fashion”. This time we will create a topic with partition 3 to improve performance in sending messages to all the subscribers.

Now, if we list topics again.

List of toics (Image by Author)

To get detailed information about the topic, we will use the below command.

It will display the details below

Create Kafka Topic with extra properties (Image by Author)

Now, we have our topics created. We will publish messages. For publishing messages, we will use the below commands.

Once, we execute the first command, it will ask publishers for messages which want to be published.

Publish Messages on Kafka Topic (Image by Author)

When in the future someone subscribes to our topic (in our term news related to technology), they will immediately receive our messages.

Create Apache Kafka Consumer and subscribe topic and receive messages

We will use the below command to subscribe above topic.

With this command, we will also ask to get messages from the beginning.

Consume subscribed messages (Image by Author)

Now, we will open multiple terminals. One will be the publisher publishing messages and 3 consumers, who have subscribed for the same topic. So when publishers will publish messages all consumers will get immediately get messages.

The first one top-left is a producer and the rest of all are consumers.

Live Publish and Consume messages (Best image captured by author till now :-p)

Create a Spark session & install the required libraries for Apache Kafka

We will first start the Spark session and we will also load the required packages/library for Apache Kafka. (& For MySQL, which we will use later for loading data into MySQL from Spark structure streaming)

We will use the below code to create Spark Session.

All the required libraries are loaded and the session is started on port 4040.

Starting Spark Session (Image by Author)

Now, we will register our Spark application as a consumer using Spark structured streaming. We will use the below code to connect the Kafka server and topic.

To print messages on the console, we will use the below code.

With this code, it will print all the published messages from the start with metadata.

Printing streaming binary messages on console (Image by Author)

This is showing data in binary format to show data in string format we need to cast value in string format. We will use the below code for the same.

It will display messages in string format as below.

Print messages in string format on console (Image by Author)

If we check Spark UI also, we can see when a new message is published for the same.

Spark UI (Image by Author)

From the Spark Streaming tab, we can see all the active structure streaming sessions.

Spakr structured streaming active session (Image by author)

We can also use awaitTermination function so our streaming session will not stop until it was force stopped.

For more details read this -> awaitTermination

We will add this to our code now and execute it again.

Once we execute this code, that particular cell will be continious running and will wait for messages.

Streaming messages (Image by Author)

Write streaming messages into the files (CSV or JSON or Delta format)

Earlier, we printed streaming messages into the console. Now we will store all the messages in a CSV file.

We will use the below code for the same.

In Format, we can pass CSV or JSON or Parquet or Delta or Avro. We will pass CSV format as we want to store messages in CSV format.

We will also need to provide path where we want to create CSV files and also need to pass the checkpoint location. The checkpoint will store all messages that are already received so the next time the streaming job starts it will start from where it stopped.

Once we execute the above command, it will create two folders. One for data and one for checkpoint.

Writing streaming messages into CSV files (Image by Author)

In the Checkpoint folder, we have metadata files that store all the details commit, offset, and metadata.

Checkpoint folder (Image by Author)

And data folder has CSV files with messages received from the publisher.

CSV files created by Spark structured streaming (Image by Author)

Write streaming messages to the database (MySQL or PostgreSQL or MongoDB)

Now, we will write the same data into the database. Here, we will store data in the MySQL database. With our Data Engineering setup, we already have a MySQL database is there. We will first create a schema in our database.

Please use the below script to create a Schema and table in MySQL Server.

Now, we will again subscribe to the same topic and store messages directly in the SQL table. We will the below code for the same.

Once, we execute this code, once the message is published it will call the function “foreach_batch_function” and that function is writing data into the SQL server. So now if we go and check the SQL table with “StreamMessagesKafka” we will see all the messages published till now.

SQL Table and messages created by Spark Structured streaming (Image by Author)

Conclusion:

  • In this blog, we learned about Apache Kafka and Spark structure streaming.
  • We also learned how to do stream in Spark using Apache Kafka.
  • We learned how to create Kafka topics and publish messages and also learned how to subscribe to topics.
  • We also learned how to print published messages on the console and how to store the same messages in CSV files or MySQL databases.

If you want to learn more about Data Engineering and Data Science follow me on LinkedIn.

LinkedIn -> https://www.linkedin.com/in/shahkalpan09/

--

--

Kalpan Shah
Plumbers Of Data Science

Senior Data Engineer | Developer | Data Enthusiast | Mentor | Amigos 😍