Apache Spark- Resilient Distributed Dataset(RDD)

R RAMYA
9 min readMay 3, 2022

--

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD)

RDD ? Something different from commonly heard things ? Right?

Before Getting into it, Do explore my blog on Spark💥

Come on! Let’s Explore on it…💥

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each datasets in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs.

RDD is a fault-tolerant collection of elements that can be operated on in parallel.

Operations of RDD:

RDDs support two types of operations:

  1. Transformations
image

→ which create a new dataset from an existing one

2. Actions

image

→ which return a value to the driver program after running a computation on the dataset.

Transformations and Actions:

Operations in Transformation:

There are two types of transformations:

Narrow Transformation :

In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD.

One Parent ==> One Child

Narrow transformations are the result of map(), filter()

Wide Transformation :

In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD.

One Parent ==> Multiple Child

Wide transformations are the result of groupbyKey() and reducebyKey()

Functions in RDD transformation:

1. map( )

The map function iterates over every line in RDD and split into new RDD.

→ Using map() transformation we take in any function, and that function is applied to every element of RDD.

map() function the return RDD can be Boolean.

Example:

x = sc.parallelize([“b”, “a”, “c”])
y = x.map(lambda z: (z, 1))
print(x.collect())
print(y.collect())

Getting Input:

Mapping :

2. flatMap()

A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD.

→ Map and flatMap are similar in the way that they take a line from input RDD and apply a function on that line.

→ The key difference between map() and flatmap() is map() returns only one element, while flatMap() can return a list of elements.

Example:

x = sc.parallelize([1,2,3])
y = x.flatMap(lambda x: (x, x*100, 42))
print(x.collect())
print(y.collect())

Getting Input:

Flatmapping:

3. filter( ):

Spark RDD filter() function returns a new RDD, containing only the elements that meet a predicate.

→ It is a narrow operation because it does not shuffle data from one partition to many partitions.

x = sc.parallelize([1,2,3])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print(x.collect())
print(y.collect())

Getting Input:

Filtering:

4. groupByKey()

When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled according to the key value K in another RDD.

→ In this transformation, lots of unnecessary data get to transfer over the network.

Example:

x = sc.parallelize([(‘B’,5),(‘B’,4),(‘A’,3),(‘A’,2),(‘A’,1)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))

Getting Input:

GroupByKey:

5. reduceByKey(func, [numTasks])

When we use reduceByKey on a dataset (K, V), the pairs on the same machine with the same key are combined, before the data is shuffled.

6. mapPartitions(func):

The MapPartition converts each partition of the source RDD into many elements of the result (possibly none).

→ In mapPartition(), the map() function is applied on each partitions simultaneously.

→ MapPartition is like a map, but the difference is it runs separately on each partition(block) of the RDD.

7. coalesce() and repartition() :

Coalesce: To avoid full shuffling of data we use coalesce() function. In coalesce() we use existing partition so that less data is shuffled.

→ Using this we can cut the number of the partition.

Repartitions: Return a new RDD which is either increased or reduced to a number of partitions.

8. Cache and Persist:

Spark Cache and persist are optimization techniques for iterative and interactive Spark applications to improve the performance of the jobs or applications

Difference:

Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s.

But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.

RDD Action:

  1. collect( )

collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with the expected result.

2. Count( ):

Action count() returns the number of elements in RDD.

3. reduce( ):

The reduce() function takes the two elements as input from the RDD and then produces the output of the same type as that of the input elements.

Features of RDD:

image

Till Now we saw what RDD is about ? Transformation and Actions , and it’s operations..

But do you know why RDD is used ?

Let’s look on to it…

Why RDD?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or you may want to do multiple ad-hoc queries over a shared data set.

There is an underlying problem with data reuse or data sharing in existing distributed computing systems (such as MapReduce) and that is , you need to store data in some intermediate stable distributed store such as HDFS or Amazon S3.

→ This makes the overall computations of jobs slower since it involves multiple IO operations, replications and serializations in the process.

Iterative Processing in MapReduce

RDDs , tries to solve these problems by enabling fault tolerant distributed In-memory computations.

Iterative Processing in Spark

Now, lets understand what exactly RDD is and how it achieves fault tolerance

RDD — Resilient Distributed Datasets

→ RDDs are Immutable and partitioned collection of records, which can only be created by coarse grained operations such as map, filter, group by etc.

→ By coarse grained operations , it means that the operations are applied on all elements in a datasets.

→ RDDs can only be created by reading data from a stable storage such as HDFS or by transformations on existing RDDs.

Now, How Is That Helping for Fault Tolerance?

image

Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.

→ Graph of transformations to produce one RDD is called as Lineage Graph.

For example:

Spark RDD Lineage Graph:

In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.

→This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.

Two ways to Create RDD’s:

1. Parallelizing

→ Parallelizing an existing collection in your driver program.

2. Referencing a dataset

image

→ Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

5 Reasons on when to use RDD’s 💥:

  1. You want low-level transformation and actions and control on your dataset;
  2. Your data is unstructured, such as media streams or streams of text;
  3. You want to manipulate your data with functional programming constructs than domain specific expressions;
  4. You don’t care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column; and
  5. You can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

Thank You Friends !!!

See you in next blog…💥

Have a Great Day :)

Cheers…

R Ramya…💥

Resources:

Data flair

Images:

https://techvidvan.com/tutorials/apache-spark-rdd/

--

--