Apache Spark- Resilient Distributed Dataset(RDD)

9 min readMay 3, 2022

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD)

RDD ? Something different from commonly heard things ? Right?

Before Getting into it, Do explore my blog on Spark…💥

Come on! Let’s Explore on it…💥

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each datasets in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

→ RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs.
RDD is a fault-tolerant collection of elements that can be operated on in parallel.

Operations of RDD:

RDDs support two types of operations:

Transformations

→ which create a new dataset from an existing one

2. Actions

→ which return a value to the driver program after running a computation on the dataset.

Transformations and Actions:

Operations in Transformation:

There are two types of transformations:

Narrow Transformation :

In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD.

One Parent ==> One Child

Narrow transformations are the result of map(), filter()

Wide Transformation :

In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD.

One Parent ==> Multiple Child

Wide transformations are the result of groupbyKey() and reducebyKey()

Functions in RDD transformation:

1. map( )

The map function iterates over every line in RDD and split into new RDD.

→ Using map() transformation we take in any function, and that function is applied to every element of RDD.

map() function the return RDD can be Boolean.

Example:

x = sc.parallelize([“b”, “a”, “c”])
y = x.map(lambda z: (z, 1))
print(x.collect())
print(y.collect())

Getting Input:

Mapping :

2. flatMap()

A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD.

→ Map and flatMap are similar in the way that they take a line from input RDD and apply a function on that line.

→ The key difference between map() and flatmap() is map() returns only one element, while flatMap() can return a list of elements.

Example:

x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, x*100, 42))
print(x.collect())
print(y.collect())

Getting Input:

Flatmapping:

3. filter( ):

Spark RDD filter() function returns a new RDD, containing only the elements that meet a predicate.

→ It is a narrow operation because it does not shuffle data from one partition to many partitions.

x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print(x.collect())
print(y.collect())

Getting Input:

Filtering:

4. groupByKey()

When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled according to the key value K in another RDD.

→ In this transformation, lots of unnecessary data get to transfer over the network.

Example:

x = sc.parallelize([(‘B’,5),(‘B’,4),(‘A’,3),(‘A’,2),(‘A’,1)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))

Getting Input:

GroupByKey:

5. reduceByKey(func, [numTasks])

When we use reduceByKey on a dataset (K, V), the pairs on the same machine with the same key are combined, before the data is shuffled.

6. mapPartitions(func):

The MapPartition converts each partition of the source RDD into many elements of the result (possibly none).

→ In mapPartition(), the map() function is applied on each partitions simultaneously.

→ MapPartition is like a map, but the difference is it runs separately on each partition(block) of the RDD.

7. coalesce() and repartition() :

Coalesce: To avoid full shuffling of data we use coalesce() function. In coalesce() we use existing partition so that less data is shuffled.

→ Using this we can cut the number of the partition.

Repartitions: Return a new RDD which is either increased or reduced to a number of partitions.

8. Cache and Persist:

Spark Cache and persist are optimization techniques for iterative and interactive Spark applications to improve the performance of the jobs or applications

Difference:

Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s.

→ But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.

RDD Action:

collect( )

collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result.

2. Count( ):

Action count() returns the number of elements in RDD.

3. reduce( ):

The reduce() function takes the two elements as input from the RDD and then produces the output of the same type as that of the input elements.

Features of RDD:

Till Now we saw what RDD is about ? Transformation and Actions , and it’s operations..

But do you know why RDD is used ?

Let’s look on to it…

Why RDD?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or you may want to do multiple ad-hoc queries over a shared data set.

→ There is an underlying problem with data reuse or data sharing in existing distributed computing systems (such as MapReduce) and that is , you need to store data in some intermediate stable distributed store such as HDFS or Amazon S3.

→ This makes the overall computations of jobs slower since it involves multiple IO operations, replications and serializations in the process.

Iterative Processing in MapReduce

RDDs , tries to solve these problems by enabling fault tolerant distributed In-memory computations.

Iterative Processing in Spark

Now, lets understand what exactly RDD is and how it achieves fault tolerance

RDD — Resilient Distributed Datasets

→ RDDs are Immutable and partitioned collection of records, which can only be created by coarse grained operations such as map, filter, group by etc.

→ By coarse grained operations , it means that the operations are applied on all elements in a datasets.

→ RDDs can only be created by reading data from a stable storage such as HDFS or by transformations on existing RDDs.

Now, How Is That Helping for Fault Tolerance?

Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.

→ Graph of transformations to produce one RDD is called as Lineage Graph.

For example:

Spark RDD Lineage Graph:

In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.

→This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.

Two ways to Create RDD’s:

1. Parallelizing

→ Parallelizing an existing collection in your driver program.

2. Referencing a dataset

→ Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

5 Reasons on when to use RDD’s 💥:

You want low-level transformation and actions and control on your dataset;
Your data is unstructured, such as media streams or streams of text;
You want to manipulate your data with functional programming constructs than domain specific expressions;
You don’t care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column; and
You can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

Thank You Friends !!!

See you in next blog…💥

Have a Great Day :)

Cheers…

R Ramya…💥

Resources:

Learn the top 5 scenarios for using Resilient Distributed Datasets (RDDs) with Spark

Back to glossary RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable…

databricks.com

Apache Spark - RDD

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed…

www.tutorialspoint.com

What Is RDD in Spark and Why Do We Need It? - DZone Big Data

Apache Spark has already over taken Hadoop (MapReduce) in general , because of variety of benefits it provides in terms…

dzone.com

Data flair

Images:

https://techvidvan.com/tutorials/apache-spark-rdd/

Spark RDDs Simplified

Spark RDDs are very simple at the same time very important concept in Apache Spark. Most of you might be knowing the…

vishnuviswanath.com

Spark Parallelize | Learn the How to Use the Spark Parallelize method?

Parallelize is a method to create an RDD from an existing collection (For e.g Array) present in the driver. The…

www.educba.com

PySpark RDD - Backbone of PySpark | PySpark Operations & Commands | Edureka

Apache Spark is one of the best frameworks when it comes to Big Data analytics. No sooner this powerful technology…

www.edureka.co

Apache Spark- Resilient Distributed Dataset(RDD)

Operations of RDD:

Transformations and Actions:

Operations in Transformation:

Functions in RDD transformation:

1. map( )

2. flatMap()

3. filter( ):

4. groupByKey()

5. reduceByKey(func, [numTasks])

6. mapPartitions(func):

7. coalesce() and repartition() :

8. Cache and Persist:

RDD Action:

Features of RDD:

Why RDD?

Iterative Processing in MapReduce

Iterative Processing in Spark

RDD — Resilient Distributed Datasets

Now, How Is That Helping for Fault Tolerance?

For example:

Spark RDD Lineage Graph:

Two ways to Create RDD’s:

1. Parallelizing

2. Referencing a dataset

5 Reasons on when to use RDD’s 💥:

Resources:

Learn the top 5 scenarios for using Resilient Distributed Datasets (RDDs) with Spark

Back to glossary RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable…

Apache Spark - RDD

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed…

What Is RDD in Spark and Why Do We Need It? - DZone Big Data

Apache Spark has already over taken Hadoop (MapReduce) in general , because of variety of benefits it provides in terms…

Spark RDDs Simplified

Spark RDDs are very simple at the same time very important concept in Apache Spark. Most of you might be knowing the…

Spark Parallelize | Learn the How to Use the Spark Parallelize method?

Parallelize is a method to create an RDD from an existing collection (For e.g Array) present in the driver. The…

PySpark RDD - Backbone of PySpark | PySpark Operations & Commands | Edureka

Apache Spark is one of the best frameworks when it comes to Big Data analytics. No sooner this powerful technology…

function in rdd transformation - Google Search

pyspark rdd cheat sheet map flatmap dataframe spark map partition vs flatmap pyspark sql narrow java

Written by R RAMYA