Apache Spark RDD

Sangeetha D

5 min readMay 11, 2022

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.

It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

There are two ways to create RDDs -

Parallelizing an existing collection in your driver program, or
Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase.

Three characteristics of RDD:

Dependencies
Partitions
Compute function.

Other Features of RDD

Limitation of RDD

No Input Optimization Engine.
Not Enough Memory.

3. Runtime type safety.

4. Handling Structured Data.

Operation in RDD

There are two different operation

1.TRANSFORMATION

2.ACTION

1. Transformation

Transformation is a process of forming new RDDs from the existing ones. Transformation is a user specific function.
It is a process of changing the current dataset in the dataset we want to have.
Also, can create any number of RDDs we want. We do not change the current RDD as we know they are immutable.
So we can produce more RDDs out of it by applying several computations. Some common transformations supported by Spark are:
For example, Map(func), Filter(func), Mappartitions (func), Flatmap (func) etc.
All transformed RDDs are lazy in nature. As we are already familiar with the term “Lazy Evaluations”.
That means it does not produce their results instantly. However, we always require an action to complete the computation.
To trigger the execution, an action is a must. Up to that action data inside RDD is not transformed or available.
After transformation, you incrementally build the lineage. That lineage is which formed by all the parent RDDs of final RDDs.
As soon as the execution process ends, resultant RDDs will be completely different from their parent RDDs.

Two category Transformation are:

Narrow Transformations,
Wide Transformations.

a.Narrow Transformations

Narrow transformations are the result of a map, filter. As such that is from the data from a single partition only. That signifies it is self-sustained.
An output RDD also has partitions with records. In the parent RDD, that output originates from a single partition.
In Apache spark narrow transformations groups as a stage. That process is mainly known as pipelining.
Pipelining is an implementation mechanism.In this mechanism, multiple instructions get overlapped in the execution process.
The computer pipeline automatically gets divided into stages.

b. Wide Transformations

Wide transformations are the result of groupByKey (func) and reduceByKey (func). As data may reside in many partitions of the parent RDD.
These are used to compute the records by data in the single partition.Wide transformations may also know as shuffle transformations. Even they may or may not depend on a shuffle.
Shuffling means redistributing data across partitions. In other words, shuffling is the process of data transfer between stages.

2. Actions

An action is an operation, triggers execution of computations and RDD transformations. Also, returns the result back to the storage or its program.
Transformation returns new RDDs and actions returns some other data types. Actions give non-RDD values to the RDD operations.It forces the evaluation of the transformation process need for the RDD they may call on.
RDD lineage (RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD.
This graph is mainly made as a result of applying transformations to the RDD.That creates a logical execution plan.
Logical execution plan starts with the earliest RDDs and also ends with the RDD. Ultimately, that plan produces the result of the action which is only called to execute.
Actions are one of two ways to send data from executors to the driver. Executors are agents that are responsible for executing different tasks. While a driver coordinates execution of tasks.
Accordingly, action eliminates the laziness of RDDs and convert that laziness into motion.

Operation in Transformation

MAP:

Map transformation means to apply operation on each element of the collection.
In Spark, the Map passes each element of the source through a function and forms a new distributed dataset.

Example:

FlatMap:

FlatMap is also a transformation operation. When we perform the operation on it, it applies on each RDD and produces new RDD out of it.
It is quite similar to map function. The difference is, FlatMap operation applies to one element but gives many results out of it.
That means from single element we may get zero, one, two etc. many results. Flatmap transformation is one step ahead of Map operation.

Example:

FILTER:

The Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true.
So, it retrieves only the elements that satisfy the given condition.

Example:

GroupBY Key:

The groupByKey function is a frequently used transformation operation that performs shuffling of data.
It receives key-value pairs (K, V) as an input, group the values based on key and generates a dataset of (K, Iterable) pairs as an output.

Example:

reduceBy Key:

The reduceByKey function is a frequently used transformation operation that performs aggregation of data.
It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.

Example:

Difference between reduceByKey and groupByKey:

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation.
The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine.

That’s all for Spark RDD

Thankyou.

see you in next blog

Apache Spark RDD

Written by Sangeetha D