Multiple Sinks In Spark Structured Streaming

Mithlesh Vishwakarma
Globant
Published in
5 min readJul 13, 2023
ref: https://qimia.io/en/blog/developing-streaming-applications-spark-structured-streaming/

While creating a data pipeline with near real-time execution, there is an interesting scenario that I have faced while reading sources, transforming complex data till writing data into multiple sinks/writes. While looking out for the solutions, I have gone through many articles explaining individual concepts of pipeline creation and execution; I believe this is the scenario with which most developers struggle with the constraints of streaming during real-time execution.

In this article, I will cover the behavior of Spark Structure Streaming over single input and multiple outputs scenarios and go into the depth of its execution.

Problem Statement

While we read data from a single source to multiple sinks/writes, multiple jobs are getting created, which causes the same data to be read multiple times from the source.

Spark Behavior when Splitting a Stream into multiple sinks

To generate the possible scenario, we are consuming data from Kafka using Structured Streaming and writing the processed dataset to S3 while using multiple writers in a single job.

When writing a dataset created from a Kafka input source, as per basic understanding in the execution, we should have a single job containing single input with multiple outputs. Still, when I checked the spark-UI, multiple jobs were created for each output.

Since we were forking the output to multiple sinks, we had to provide the checkpointing for each sink/write. As we were forking the output, which caused the creation of multiple streaming queries(jobs).

Due to Spark parallelism, multiple streaming queries(jobs) can be scheduled independently and can maintain their own Checkpoint Location; due to this behavior, the data is read from the input source independently (multiple times based on no sinks/writes), the same thing is reflected in Spark-UI.

In conclusion, the data is expected to be read multiple times from the input topic even when the same dataset is created from that source and write that data to multiple sinks/writes.

In the above image, the timeline portion exactly shows the same response where each sink has its job ID, and it will repeatedly start reading data from the source as an individual.

The above screenshot gives a better idea of how Spark Structured Streaming internally works when executing multiple writers over a single read.

For the same batch, it creates multiple jobs with the same query plan, hence showing multiple read attempts while writing to multiple sinks.

If we have N sinks, it means you have N queries. Each query has its own checkpointing and fetches data from the source independently.

Each sink/write creates a new stream, runs independently, and offsets are picked from their own checkpoint location. Different group IDs are created for each stream as if you are running different read + write streams.

Cons: This causes performance degradation when we deal with a massive dataset in our source/input. The above use case has tested over 51K messages in the Kafka topic, showing 4 seconds for individual write streams.

Below is the implementation of multiple sinks/writes.

Solution: Achieving Writes to multiple sinks using forEachBatch writer

We can prevent multiple read attempts while writing to multiple sinks using forEachBatch write.

During this execution, we are consuming data from Kafka using Structured Streaming and writing the processed data set to S3 while using multiple writers in a single job using forEachBatch write.

ref: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-MicroBatchExecution.html

The difference between normal data execution and with forEachBatch is during forEachBatch the single streaming query executes till forEachBatch process execution.

Inside forEachBatch we will get the batch dataframe behavior of the stream where we can write to multiple destinations.

Due to a single streaming query, we are maintaining a single Checkpoint Location.

With forEachBatch writer, we have the opportunity to cache the dataframe.

The above image shows performance optimization over the multiple sinks/writes without reading from the source multiple times.

In the initial job, the data is read from the source, and using forEachBatch it helps to get cached and retrieve the data for the rest of the job.

Cons: This only supports sequential execution, though we cause max computing consumption issues.

Below is the code implementation using forEachBatch writes:

https://gist.github.com/mithsv20/c94fd629b359f7798041926de22d102f

This code reads data from a Kafka stream and writes it to three different Parquet files. The code uses a streaming query to process the data in batches and calls a function to write using forEachBatchto write the Parquet files. The function forEachBatchto prevent reading multiple reads from the source. The code also uses a checkpoint to ensure the data is not lost if the Spark application fails.

Conclusion

In this article, we discussed the behavior of Spark Structured Streaming when writing data to multiple sinks. We saw that by default, Spark will create a separate streaming query for each sink, which can lead to performance degradation and multiple reads of the same data from the source. To avoid this, we can use the forEachBatch writer, which allows us to write to multiple sinks within a single streaming query. The forEachBatch writer also allows us to cache the data frame, which can further improve performance.

Here are some of the key takeaways from this article:

  • Spark Structured Streaming will create a separate streaming query for each sink by default.
  • This can lead to performance degradation and multiple reads of the same data from the source.
  • To avoid this, we can use the forEachBatch writer.
  • The forEachBatch writer allows us to write to multiple sinks within a single streaming query.
  • It also allows us to cache the data frame, further improving performance.

I hope you found this article helpful. If you have any questions, please feel free to comment below.

Feel free to follow me on LinkedIn for more articles.

Here are some additional resources that you may find helpful:

--

--

Mithlesh Vishwakarma
Globant
Writer for

I am a data enthusiast working in a Data and AI company. Worked in python, java, spark.