Introduction to working with(in) Spark

Structured Streaming in Spark 3.0 Using Kafka

Using Docker, Spark 3.0, Kafka and Python

WesleyBos

Published in

The Startup

5 min readAug 21, 2020

After the previous post wherein we explored Apache Kafka, let us now take a look at Apache Spark. This blog post covers working within Spark’s interactive shell environment, launching applications (including onto a standalone cluster), streaming data and lastly, structured streaming using Kafka. To get started right away, all of the examples will run inside Docker containers.

Spark

Spark was initially developed at UC Berkeley’s AMPLab in 2009 by Matei Zaharia, and open-sourced in 2010. In 2013 its codebase was donated to the Apache Software Foundation which released it as Apache Spark in 2014.

“Apache Spark™ is a unified analytics engine for large-scale data processing”

It offers APIs for Java, Scala, Python and R. Furthermore, it provides the following tools:

Spark SQL: used for SQL and structured data processing.
MLib: used for machine learning.
GraphX: used for graph processing.
Structured Streaming: used for incremental computation and stream processing.

Prerequisites:
This project uses Docker and docker-compose. View this link to find out how to install them for your OS.

Clone my git repo:

git clone https://github.com/Wesley-Bos/spark3.0-examples.git

Wesley-Bos — Overview

My interests lie in the domain of Big Data. I like to view the world as one big data problem and within lies the…

github.com

Note: depending on your pip and Python version, the commands vary a little:

pip becomes pip3
python become python3

Before we begin, create a new environment. I use Anaconda to do this but feel free to use any tool of your liking. Activate the environment and install the required libraries by executing the following commands:

pip install -r requirements.txt

Be sure to activate the correct environment in every new terminal you open!

1. Spark interactive shell

Run the following commands to launch Spark:

docker build -t pxl_spark -f Dockerfile .docker run --rm --network host -it pxl_spark /bin/bash

Executing code, in Spark, can be performed within the interactive shell or by submitting the programming file directly to Spark, as an application, using the command spark-submit.

To start up the interactive shell, run the command below:

pyspark

This post centres on working with Python. However, if you desire to work in Scala, use spark-shell instead.

Try out these two examples to get a feeling with the shell environment.

Read a .csv file:

>>> file = sc.textFile(“supplementary_files/subjects.csv”)
>>> file.collect()
>>> file.take(AMOUNT_OF_SAMPLES)>>> subjects = file.map(lambda row: row.split(“,”)[0])
>>> subjects.collect()

Read a text file:

>>> file = sc.textFile(“supplementary_files/text.txt”)
>>> file.collect()
>>> file.take(2)>>> wordCount = file.flatMap(lambda text: text.lower().strip().split(“ “)).map(lambda word: (word, 1)).reduceByKey(lambda sum_occurences, next_occurence: sum_occurences+next_occurence)>>> wordCount.collect()

Press Ctrl+d to exit the shell.

2. Spark application — launching on a cluster

Spark applications can be performed by itself or on a cluster. The most straightforward approach is deploying Spark on a private cluster.

Follow the instructions below to execute an application on a cluster.

Initiate the Spark container:

docker run --rm --network host -it pxl_spark /bin/bash

Start a master:

start-master.sh

Go to the web UI and copy the URL of the Spark Master.

Start a worker:

start-slave.sh URL_MASTER

Reload the web UI; a worker should be added.

Launch an example onto the cluster:

spark-submit --master URL_MASTER examples/src/main/python/pi.py 10

View the web UI; an application has now been completed.

Consult the official documentation for more specific information.

3. Spark application — streaming data

The above examples solely handled stationary code. The subsequent cases entail streaming data along with five DStream transformations to explore.

Note that these transformations are a mere glimpse of the viable options. View the official documentation for additional information regarding pyspark streaming.

In a separate terminal, run the netcat command on port 8888:

nc -lC 8888

In the Spark container, submit one of the cases for DStreams. Beneath is a summary of what each code sample does.

reduce_by_key.py: count the occurrence of the word ‘ERROR’, per batch.
update_by_key.py: count the occurrence of all the words throughout a stream of data.
count_by_window.py: count the number of lines within a window.
reduce_by_window.py: calculate the sum of the values within a window.
reduce_by_key_and_window.py: count the occurrence of ERROR-messages within a window.

Enter text data inside the netcat terminal. In the Spark terminal, the data is displayed accordingly. An example can be seen in the images below.

spark-submit python_code_samples/update_by_key.py

4. Spark structured streaming

Lastly, there is structured streaming. A concise, to the point, description of structured streaming reads: “Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.”

The objective of this last section is to ingest data into Kafka, access it in Spark and finally write it back to Kafka.

Launch the Kafka environment:

docker-compose -f ./kafka/docker-compose.yml up -d

Produce and consume data:

For convenience, open two terminal beside each other.

python kafka/producer.pypython kafka/consumer.py

Submit the application to Spark (inside the Spark container):

spark-submit --packages org.apache.spark:spark-sql-kafka-0–10_2.12:3.0.0 python_code_samples/kafka_structured_stream.py

Open a new terminal and start the new_consumer:

python kafka/new_consumer.py

In the producer terminal, enter data; both consumers will display this data. The messages can be seen in the Confluent Control centre as well.

Recap

Throughout this article, we explored the following issues:

Reading files within the interactive shell.
Launching an application; by itself and on a cluster.
Working with streaming data.
Working with structured streaming.

An interesting blog post from Databricks gives a more extensive view of structure streaming. This particular post explains how to utilise Spark to consume and transform data streams from Apache Kafka.

Lastly, I want to thank you for reading until the end! Any feedback on where and how I can improve is much appreciated. Feel free to message me.

References:

Spark Overview

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala…

spark.apache.org

Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for…

en.wikipedia.org

Structured Streaming Programming Guide

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can…