Start Your Journey with Apache Spark — Part 1

Understanding Apache Spark and RDD (Resilient Distributed Datasets)

Published in

Expedia Group Technology

7 min readSep 3, 2019

Let’s begin our journey with Apache Spark™️. Here at Hotels.com™️ (part of Expedia Group™️) we use Apache Spark for Data Analysis, Data Science and building Machine Learning capabilities. In this blog series, I discuss Apache Spark and its RDD and Data Frame components in detail. This is a Part-1 of this series, you can also refer to Part-2 and Part-3. Apache Spark is a general-purpose, in-memory computing engine. Spark can be used with Hadoop, Yarn and other Big Data components to harness the power of Spark and improve the performance of your applications. It provides high-level APIs in Scala, Java, Python, R, and SQL.

Spark Architecture

Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called “Workers”. The starting point of your Spark application is sc, a Spark Context Class instance. It runs inside the driver.

Let’s try to understand the architecture of Apache Spark :

Apache Spark: Sometimes also called Spark Core. The Spark Core implementation is a RDD (Resilient Distributed Dataset) which is a collection of distributed data across different nodes of the cluster that are processed in parallel.
Spark SQL: The implementation here is DataFrame, which is a relational representation of the data. It provides functions with SQL like capabilities. Also, we can write SQL like queries for our data analysis.
Spark Streaming: The implementation provided by this library is D-stream, also called Discretized Stream. This library provides capabilities to process/transform data in near real-time.
MLlib: This is a Machine Learning library with commonly used algorithms including collaborative filtering, classification, clustering, and regression.
GraphX: This library helps us to process Graphs, solving various problems (like Page Rank, Connected Components, etc) using Graph Theory.

Let’s dig a little deeper into Apache Spark (Spark Core), starting with RDD.

Note: I will use the Python API of Spark called pySpark for writing the code. However, concepts remain the same for all the APIs.

Let’s create a RDD

Create a RDD based on Python collection.

keywords = [‘Books’, ‘DVD’, ‘CD’, ‘PenDrive’]key_rdd = sc.parallelize(keywords)

In the above code “keywords” is a python collection (List) and we are creating a RDD from a Python collection using the “parallelize” method of the Spark Context Class.

Create a RDD from a file

file_rdd = sc.textFile(“Path_to_File”)

There are two types of operations performed on a RDD :

Transformations: These operations work in a lazy fashion. When you apply a transformation on a RDD it will not be evaluated immediately but will only be stored in a DAG (Directed Acyclic Graph) and will be evaluated at some later point of time after an action is executed. Some common transformations are map, filter, flatMap, groupByKey, reduceByKey, etc.
Actions: These operations will be executed immediately. Some common actions are first, last, count, collect, etc.

Tip: RDDs are immutable in nature, you cannot change RDDs. However, you can transform one RDD to another using various Transformations.

Anatomy of Spark Job

Application: When we submit the Spark code to a cluster it creates a Spark Application.
Job: The Job is the top-level execution for any Spark application. A Job corresponds to an Action in a Spark application.
Stage: Jobs will be divided into stages. The Transformations work in a lazy fashion and will not be executed until an Action is called. Actions might include one or many Transformations and the Transformations define the breakdown of jobs into stages, which corresponds to a shuffle dependency.
Task: Stages will be further divided into various tasks. The task is the most granular unit in Spark applications. Each task represents a local computation on a particular node in the Spark Cluster.

Now we have a understanding of Spark, Spark Architecture, RDDs and the anatomy of a Spark Application. Let’s get our hands dirty with some hands-on exercises.

You can execute your Spark code by using a shell (Spark-shell or pyspark), Jupyter Notebooks, Zeppelin Notebooks, Spark-submit, etc.

Let’s create a RDD and understand some basic transformations.

Create a RDD from a collection.

num = [1,2,3,4,5]num_rdd = sc.parallelize(num)

Here num_rdd is an RDD based on python collection(list).

Transformations

As we know, Transformations are lazy in nature and they will not be executed until an Action is executed on top of them. Let’s try to understand various available Transformations.

map: This will map your input to some output based on the function specified in the map function.

We already have “num_rdd” created. Let’s try to double each number in RDD.

double_rdd = num_rdd.map(lambda x : x * 2)

Note: The expression specified inside the map function is another function without any name which is called a lambda function or anonymous function.

filter: To filter the data based on a certain condition. Let’s try to find the even numbers from num_rdd.

even_rdd = num_rdd.filter(lambda x : x % 2 == 0)

flatMap: This function is very similar to map, but can return multiple elements for each input in the given RDD.

flat_rdd = num_rdd.flatMap(lambda x : range(1,x))

This will return the range object for each element in the input RDD (num_rdd).

distinct: This will return distinct elements from an RDD.

rdd1 = sc.parallelize([10, 11, 10, 11, 12, 11])dist_rdd = rdd1.distinct()

The above Transformations are single-valued, where each element within a RDD contains a single scalar value. Let’s discuss some key-value pair RDDs, where each element of the RDD will be a (key, value) pair.

reduceByKey: This function reduces the key values pairs based on the keys and a given function inside the reduceByKey. Here’s an example.

pairs = [ (“a”, 5), (“b”, 7), (“c”, 2), (“a”, 3), (“b”, 1), (“c”, 4)]pair_rdd = sc.parallelize(pairs)

pair_rdd is now key-value pair RDD.

output = pair_rdd.reduceByKey(lambda x, y : x + y)

the output RDD will contain the pairs :

[ (“a”, 8), (“b”, 8), (“c”, 6) ]

Let’s try to understand the contents of the output RDD here. We can think of the reduceByKey function in 2 steps.

It will collect all the values for a given key. So the intermediate output will be as follows :

(“a”, <5,3>)(“b”, <7, 1>)(“c”, <2, 4>)

2. Now we have all the values corresponding to a particular key. Then the “values” collection will be reduced or aggregated based on the function mentioned inside the reduceByKey. In our case it is the sum function, so we are getting the sum of all the values for a given key. Hence the output is :

[ (“a”, 8), (“b”, 8), (“c”, 6) ]

groupByKey: This function is another ByKey function which can operate on a (key, value) pair RDD but this will only group the values based on the keys. In other words, this will only perform the first step of reduceByKey.

grp_out = pair_rdd.groupByKey()

sortByKey: This function will perform the sorting on a (key, value) pair RDD based on the keys. By default, sorting will be done in ascending order.

pairs = [ (“a”, 5), (“d”, 7), (“c”, 2), (“b”, 3)]raw_rdd = sc.parallelize(pairs)sortkey_rdd = raw_rdd.sortByKey()

This will sort the pairs based on keys. So the output will be

[ (“a”, 5), (“b”, 3), (“c”, 2), (“d”, 7)]

Note: for sorting in descending order pass “ascending=False”.

sortBy: sortBy is a more generalized function for sorting.

pairs = [ (“a”, 5, 10), (“d”, 7, 12), (“c”, 2, 11), (“b”, 3, 9)]raw_rdd = sc.parallelize(pairs)

Now we have got a RDD of tuples where each tuple has 3 elements in it. Let’s try to do the sorting based on the 3rd element of the tuple.

sort_out = raw_rdd.sortBy(lambda x : x[2])

Note: for sorting in descending order pass “ascending=False”.

There are various other Transformations which you can find here.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

Actions

Actions are operations on RDDs which execute immediately. While Transformations return another RDD, Actions return language native data structures.

Count: This will count the number of elements in the given RDD.

num = sc.parallelize([1,2,3,4,2])num.count() # output : 5

First: This will return the first element from given RDD.

num.first() # output : 1

Collect: This will return all the elements for the given RDD.

num.collect() # output : [1,2,3,4,2]

Note: We should not use the collect operation while working with large datasets. Because it will return all the data which is distributed across the different workers of your cluster to a driver. All the data will travel across the network from worker to driver and also the driver would need to hold all the data. This will hamper the performance of your application.

Take: This will return the number of elements specified.

num.take(3) # output : [1, 2, 3]

There are various other actions which you can find here.

RDD Programming Guide

Spark 2.4.3 is built and distributed to work with Scala 2.12 by default. (Spark can be built to work with other…

spark.apache.org

In part two we will learn about “Spark SQL/DataFrames and its operations” and in part three we will learn about “Advanced Spark DataFrame Operations and Catalog API”. You can find the notebooks on github here. I would love to hear the feedback on this blog series. I hope you have enjoyed this blog. Happy Learning.

You may also be interested in some of my other posts on Apache Spark.