Structured Streaming

Introduction

Arun Jijo
DataKare Solutions
3 min readFeb 26, 2019

--

As streaming frameworks are emerging gradually, it encourages the developers to concentrate on business challenges rather than focussing on potential streaming analytics issues. Structured Streaming is a part of the Apache Spark venture, which built on top of the Spark SQL engine for streaming analytics. Structured streaming is fault tolerant, scalable and mainly it conceals all the streaming complexities from the developer which makes the development easier.

Structured streaming was introduced in Spark 2.0 as a part of the Apache Spark project as a micro batch stream processing engine. It was marked as stable in Spark 2.2 which makes Structured Streaming as the standard stream processing engine with spark and termed DStreams as a legacy. This tutorial series will walk you through Structured Streaming and the recently introduced continuous streaming.

Structured Streaming operates on top of SQL engine of the spark. In Structured Streaming transformations, aggregations, event-time-windows all works on top of Dataframes/Datasets i.e. The data will be in the form of unbounded table and will be changed for each micro batch. So, it is possible to reuse the same logic intended for batch processing in Streaming analytics too.

Advantages of Structured Streaming:

Event Time Processing

In structured streaming it is possible to process data based on the event time i.e. When the data is generated rather than when it arrives to the processing engine. This is achieved by using the time field in the data. This also helps to process the data in order even though the arriving time of the data to the processing engine is out of order.

Watermarks

Structured Streaming supports watermarks, i.e. The maximum delay time it might take the data to arrive. So that structured streaming can keep the old data for a certain period of time

Stream Processing on SQL engine

Structured streaming is the first stream processing engine built on top of SQL. So, the Dataframe/Dataset API is simpler to use and familiar. And also it can take leverage of the optimization of the Spark SQL engine.

Let’s have a quick glance at an example, then will discuss further.

The above example will be familiar with most of you, if you have used Spark before, except for the one difference “readStream” function instead of the traditional “read” function. The readStream function returns the “DataStreamReader” instead of the “DataFramereader” which was return by the read function. SparkSession is the entry point for Structured Streaming too, it is not necessary to create a separate streaming context object like streams.

DataFrameReader (StructuredSteaming) supports data source like Kafka, socket, rate and File sources. In these data sources socket and rate is for testing and File sources supports data types like CSV, JSON, ORC, parquet, text. And the load function which was invoked on the DataStreamReader will load the data from the data source.

Till this point we have set up the streaming query. To actually receive the data and write it to the sinks structured streaming uses DataStreamWriter.

writeStream function returns a DataStreamWriter and the output mode defined how the data should be written to the sink and then specify the sink which is console in this case and then the trigger function is used to query the data source for every 10 seconds.

We will be covering about all the sources, sinks, output modes, triggers and all the necessary detail in depth in the upcoming chapter of this series.

--

--

Arun Jijo
DataKare Solutions

Data engineer at DataKare Solutions who gained expertise at Apache Nifi, Kafka, Spark and passionate in Java.