Apache Spark Streaming vs Structured Streaming

Anand Satheesh
4 min readMay 22, 2024

--

Apache Spark, a unified analytics engine for large-scale data processing, offers two primary ways to handle real-time data: Spark Streaming and Structured Streaming. Both approaches enable the processing of streaming data, but they have significant differences in their architecture, APIs, and capabilities.

Overview

Spark Streaming:

  • Introduced as an extension of the core Spark API.
  • Uses DStreams (Discretized Streams), which are micro-batch abstractions.
  • Designed for ease of integration with the Spark ecosystem.
  • Offers good scalability and fault tolerance.

Structured Streaming:

  • Introduced as part of Spark 2.0.
  • Built on top of Spark SQL engine.
  • Uses DataFrames and Datasets.
  • Provides end-to-end exactly-once semantics.
  • Focuses on providing a unified model for both batch and stream processing.

Architecture

Spark Streaming Architecture:

  • Based on the concept of micro-batching.
  • Divides the stream of data into small batches (typically 1 second to several seconds).
  • Each batch is processed as a Resilient Distributed Dataset (RDD).
  • Uses DStreams, which represent a continuous series of RDDs.
  • Fault tolerance achieved through lineage information.

Structured Streaming Architecture:

  • Operates on a continuous processing model.
  • Uses the same DataFrame and Dataset API used for batch processing.
  • Allows for streaming queries that are treated as a continuous computation.
  • Built on the Catalyst optimizer and Tungsten execution engine.
  • Supports both micro-batching and continuous processing modes.
  • End-to-end exactly-once fault tolerance.

APIs and Programming Model

Spark Streaming APIs:

  • Primary abstraction: DStream.
  • Operations are similar to RDD transformations and actions.
  • Provides a rich set of built-in sources and sinks (e.g., Kafka, HDFS, TCP sockets).
  • Example of a simple Spark Streaming application:
val conf = new SparkConf().setAppName("SparkStreamingExample")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

Structured Streaming APIs:

  • Primary abstractions: DataFrame and Dataset.
  • Uses the same high-level APIs for both batch and streaming computations.
  • Supports various output modes: Append, Update, and Complete.
  • Built-in support for event-time processing and watermarks.
  • Example of a simple Structured Streaming application:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate()
val lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

Fault Tolerance and Checkpointing

Spark Streaming:

  • Fault tolerance is based on the lineage of RDDs.
  • Uses checkpointing to recover stateful transformations.
  • Checkpointing needs to be enabled and configured by the user.

Structured Streaming:

  • Fault tolerance is built into the engine.
  • Automatically tracks the progress and state of the streaming query.
  • Supports checkpointing and event-time processing with watermarks.
  • No need for explicit checkpointing in most scenarios.

State Management

Spark Streaming:

  • State management is achieved through stateful transformations like updateStateByKey.
  • Requires manual handling of state persistence and recovery.

Structured Streaming:

  • State management is integrated into the API.
  • Supports stateful operations using mapGroupsWithState and flatMapGroupsWithState.
  • Offers built-in support for session windows and event-time processing.

Performance and Scalability

Spark Streaming:

  • Performance depends on the batch interval size.
  • Suited for applications with less stringent latency requirements.
  • Scales well with increasing data volume and cluster size.

Structured Streaming:

  • Optimized for low-latency and high-throughput scenarios.
  • Leverages the Catalyst optimizer for query optimization.
  • Can handle both high-frequency data streams and complex stateful computations efficiently.

Use Cases

Spark Streaming:

  • Log processing.
  • Real-time analytics dashboards.
  • Simple event detection.

Structured Streaming:

  • Complex event processing.
  • Real-time data pipelines with exactly-once semantics.
  • Real-time ETL and continuous applications.
  • Scenarios requiring unified batch and stream processing.

DStreams (Discretized Streams) are considered deprecated in favor of Structured Streaming in Apache Spark. The shift from DStreams to Structured Streaming represents an evolution in Spark’s approach to stream processing, aiming to provide a more unified and efficient model.

Reasons for Deprecation

  1. Unified API: Structured Streaming uses the same DataFrame and Dataset API as batch processing, offering a more consistent and easier-to-use API.
  2. Fault Tolerance and Exactly-Once Semantics: Structured Streaming supports exactly-once processing semantics, which is more reliable and robust compared to the at-least-once semantics typically associated with DStreams.
  3. Optimized Execution: Structured Streaming leverages Spark’s Catalyst optimizer and Tungsten execution engine for more efficient query planning and execution.
  4. Ease of Use: Structured Streaming simplifies the development of streaming applications with built-in support for event-time processing, watermarks, and stateful operations.
  5. Community and Support: The Spark community and developers are focusing on enhancing Structured Streaming, making it the preferred choice for new streaming applications.

Transition from DStreams to Structured Streaming

If you have existing Spark Streaming applications using DStreams, it is recommended to migrate to Structured Streaming to take advantage of the improved features and future support. The migration involves rewriting the streaming logic using the DataFrame and Dataset API. Here’s a brief guide on transitioning from DStreams to Structured Streaming:

  1. Input Sources: Use the appropriate DataFrame readers for input sources.
  • DStreams: ssc.socketTextStream("localhost", 9999)
  • Structured Streaming: spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

2. Transformations: Replace RDD transformations with DataFrame/Dataset transformations.

  • DStreams: lines.flatMap(_.split(" ")).map(x => (x, 1)).reduceByKey(_ + _)
  • Structured Streaming: lines.as[String].flatMap(_.split(" ")).groupBy("value").count()

3. Output Sinks: Use DataFrame writers for output sinks.

  • DStreams: wordCounts.print()
  • StructuredStreaming:wordCounts.writeStream.outputMode("complete").format("console").start()

4. Stateful Operations: Utilize stateful operations in Structured Streaming like mapGroupsWithState for maintaining state.

Conclusion

Both Spark Streaming and Structured Streaming provide powerful capabilities for processing real-time data, but they cater to different needs and use cases. Spark Streaming, with its micro-batching model and DStream abstraction, is suitable for simpler applications and environments where the latency requirements are less stringent. On the other hand, Structured Streaming, with its continuous processing model, DataFrame and Dataset API, and support for exactly-once semantics, is ideal for more complex and demanding real-time data processing scenarios.

When deciding between Spark Streaming and Structured Streaming, consider the specific requirements of your application, including latency, fault tolerance, state management, and the complexity of the processing logic. Structured Streaming is generally recommended for new projects due to its advanced features and seamless integration with the Spark ecosystem. However, Spark Streaming remains a viable option for existing projects and simpler use cases.

--

--

Anand Satheesh

Experienced software architect with 11+ years in big data. Passionate about building scalable solutions and driving innovation.