Apache Spark in Practice

Gökhan Gürgeç

Published in

cloudnesil

3 min readOct 15, 2019

One of my previous stories was about general overview of Apache Spark;

Apache Spark at a glance

This story was a presentation about Apache Spark and converted to story. It is a general presentation of Apache Spark…

medium.com

and the other was a practical example of streaming data with Apache Kafka and Akka Streams;

Streaming data with Apache Kafka and Akka Streams

This post will demonstrate how data can be streamed with Akka streams and connected to Apacha Kafka with Alpakka Kafka…

medium.com

In this story, by using the data in “Streaming data with Apache Kafka and Akka Streams”, I want to demonstrate a practical usage of Apache Spark.

Implementation was done in Java and available in github.

Simply implementation does this:

It reads the books.csv that contains about 48000 goodreads books information from kaggle. The books are in various languages. Our application classifies the books according to languages and creates csv files for every different language.

Dependencies and Configuration:

In this example, we used Java 8 and current latest version of Apache Spark 2.4.4 and used Maven as for build management. My Maven version was 3.53.

a. pom.xml

pom.xml is simple. We have 1 dependency. We will package our project as jar in order to submit our user code to Apache Spark.

2. Main Class:

We have one class and it is Main class. :)

a. The first thing we should do for writing a spark application is creating a SparkSession.

“Spark Session is a unified entry point of a spark application from Spark 2.0”

We create a SparkSession by using builder pattern. We can add some configurations in builder pattern but we only add appName.

After SparkSession is created, our user code is executed in executors with the orchestration of SparkSession which is the driver process. Here is the architecture of a Spark Application:

Spark: The Definitive Guide Big Data Processing Made Simple, Bill Chambers and Matei Zaharia

b. Our next step is to read the csv file and convert it to DataFrame:

One of Apache Spark strength comes from its native support for most of the datasource types. csv is one of them. We used SparkSession’s read method that is used for reading non-streaming data, with configurations and load our csv file as DataFrame.

You can say that you are saying DataFrame but returning type is Dataset<Row>. You are right. But I can say that they are same. Dataset add type safety to DataFrame and a Dataset with type Row is the same thing with DataFrame in Java.

c. Our next step is creating a new DataFrame by selecting language_code column and getting distinct values.

It is important to understand the transformation and action notions in Apache Spark. Transformations are simply transforming RDD/DataFrame/Dataset to new RDD/DataFrame/Dataset. Transformations are not run when they are defined. They are only meaningful ending with an action. Here we make select and distinct transformations to our bookDataSet. And we have two actions show- prints the DataFrame to console and collectAsList converts our DataFrame to List.

d. The next step is looping the language list and filtering our DataFrame according to language and creating csv file for each language.

We create Language specific books datasets by filtering our main books dataset and writing our dataset to csv files.

Our saving mode is to Overwrite if folder exists.

And that’s all. You classified your books according to languages.

Apache Spark in Practice

Apache Spark at a glance

This story was a presentation about Apache Spark and converted to story. It is a general presentation of Apache Spark…

Streaming data with Apache Kafka and Akka Streams

This post will demonstrate how data can be streamed with Akka streams and connected to Apacha Kafka with Alpakka Kafka…

Written by Gökhan Gürgeç