EXPEDIA GROUP TECHNOLOGY — DATA

Apache Spark Structured Streaming — First Streaming Example (1 of 6)

Using a scalable and fault-tolerant stream processing engine

Neeraj Bhadani
Expedia Group Technology

--

A waterfall on a small stream
Photo by Mike Lewis HeadSmart Media on Unsplash

In this blog series, we discuss Apache Spark™️ Structured Streaming. We cover components of Apache Spark Structured Streaming and play with examples to understand them.

You may also be interested in some of my other posts on Apache Spark.

Note: We use a Scala API in this blog series. For easy reference, Scala files are available on GitHub.

Let’s get started with our journey on Apache Spark Structured Streaming.

Overview

Working with streaming data is a little different from working with batch data. With streaming data, we will never have complete data for analysis, as data is continuously coming in. Apache Spark provides a streaming API to analyze streaming data in pretty much the same way we work with batch data. Apache Spark Structured Streaming is built on top of the Spark-SQL API to leverage its optimization. Spark Streaming is a processing engine to process data in real-time from sources and output data to external storage systems.

Diagram of multiple input sources piping into Spark Streaming which then pipes to multiple persistent stores.
Reference

Spark Streaming has 3 major components: input sources, streaming engine, and sink. Input sources generate data like Kafka, Flume, HDFS/S3, etc. Spark Streaming engine processes incoming data from various input sources. Sinks store processed data from Spark Streaming engine like HDFS, relational databases, or NoSQL datastores.

Let’s conceptualise Spark Streaming data as an unbounded table where new data will always be appended at the end of the table.

Updates within a stream of data are mapped to rows added to the bottom of a table.
Reference

Spark will process data in micro-batches which can be defined by triggers. For example, let's say we define a trigger as 1 second, this means Spark will create micro-batches every second and process them accordingly. We will discuss triggers in a separate blog.

Output modes

After processing the streaming data, Spark needs to store it somewhere on persistent storage. Spark uses various output modes to store the streaming data.

  • Append Mode: In this mode, Spark will output only newly processed rows since the last trigger.
  • Update Mode: In this mode, Spark will output only updated rows since the last trigger. If we are not using aggregation on streaming data (meaning previous records can’t be updated) then it will behave similarly to append mode.
  • Complete Mode: In this mode, Spark will output all the rows it has processed so far.

Our first streaming example using rate source

Now let’s get our hands dirty with our first Spark Streaming example using ratesource and consolesink. Rate source will auto-generate data which we will then print onto a console.

Import libraries

Let’s first import the required libraries.

Create Spark session

Now let’s create the sparkSession and set the logging level to Error to avoid the Warning and INFO logs.

Create streaming DataFrame

Let’s create our first Spark Streaming DataFrame using rate source. Here we have specified the format as rate and specified rowsPerSecond = 1 to generate 1 row for each micro-batch and load the data into initDF streaming DataFrame. Also, we check if the initDF is a streaming DataFrame or not.

Output should read as follows:

Streaming DataFrame : true

Basic transformation

Perform a basic transformation on initDF to generate another column result by just adding 1 to column value :

We created a derived column result from an existing column value in a very similar way to creating one in a batch DataFrame. You can find various operations on batch DataFrame in this blog.

Output to console

Let’s try to print the contents of a streaming DataFrame to console. Here, we use append output mode to output only newly generated data and format as console to print the result on the console.

This output should resemble the following:

-------------------------------------------
Batch: 1
-------------------------------------------
+-----------------------+-----+------+
|timestamp |value|result|
+-----------------------+-----+------+
|2020-12-28 11:40:37.867|0 |1 |
+-----------------------+-----+------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----------------------+-----+------+
|timestamp |value|result|
+-----------------------+-----+------+
|2020-12-28 11:40:38.867|1 |2 |
+-----------------------+-----+------+

Here we printed the result of 2 micro-batches which contain columns timestamp, value and the generated result column. We have only one record per second (check timestamps above) as specified in initDF and value was generated automatically. Also, in each batch we only get newly generated data since we have used the append output mode.

We have successfully written our first spark streaming application!

The next post in this series covers input sources of different types.

I hope you enjoyed this overview of Spark Streaming! For easy reference, you can find the complete code on GitHub.

Here are other blogs on Apache Spark Structured Streaming series.

Reference

This blog post is provided for educational purposes only. Certain scenarios may utilize fictitious names and test data for illustrative purposes. The content is intended as a contribution to the technological community at large and may not necessarily reflect any technology currently being implemented or planned for implementation at Expedia Group, Inc.

--

--