Apache Spark RDD
5 min readMay 11, 2022
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
- It is an immutable distributed collection of objects.
- Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
- RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
There are two ways to create RDDs -
- Parallelizing an existing collection in your driver program, or
- Referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase.
Three characteristics of RDD:
- Dependencies
- Partitions
- Compute function.
Other Features of RDD
Limitation of RDD
- No Input Optimization Engine.
- Not Enough Memory.
3. Runtime type safety.
4. Handling Structured Data.
Operation in RDD
There are two different operation
1.TRANSFORMATION
2.ACTION
1. Transformation
- Transformation is a process of forming new RDDs from the existing ones. Transformation is a user specific function.
- It is a process of changing the current dataset in the dataset we want to have.
- Also, can create any number of RDDs we want. We do not change the current RDD as we know they are immutable.
- So we can produce more RDDs out of it by applying several computations. Some common transformations supported by Spark are:
- For example, Map(func), Filter(func), Mappartitions (func), Flatmap (func) etc.
- All transformed RDDs are lazy in nature. As we are already familiar with the term “Lazy Evaluations”.
- That means it does not produce their results instantly. However, we always require an action to complete the computation.
- To trigger the execution, an action is a must. Up to that action data inside RDD is not transformed or available.
- After transformation, you incrementally build the lineage. That lineage is which formed by all the parent RDDs of final RDDs.
- As soon as the execution process ends, resultant RDDs will be completely different from their parent RDDs.
Two category Transformation are:
- Narrow Transformations,
- Wide Transformations.
a.Narrow Transformations
- Narrow transformations are the result of a map, filter. As such that is from the data from a single partition only. That signifies it is self-sustained.
- An output RDD also has partitions with records. In the parent RDD, that output originates from a single partition.
- In Apache spark narrow transformations groups as a stage. That process is mainly known as pipelining.
- Pipelining is an implementation mechanism.In this mechanism, multiple instructions get overlapped in the execution process.
- The computer pipeline automatically gets divided into stages.
b. Wide Transformations
- Wide transformations are the result of groupByKey (func) and reduceByKey (func). As data may reside in many partitions of the parent RDD.
- These are used to compute the records by data in the single partition.Wide transformations may also know as shuffle transformations. Even they may or may not depend on a shuffle.
- Shuffling means redistributing data across partitions. In other words, shuffling is the process of data transfer between stages.
2. Actions
- An action is an operation, triggers execution of computations and RDD transformations. Also, returns the result back to the storage or its program.
- Transformation returns new RDDs and actions returns some other data types. Actions give non-RDD values to the RDD operations.It forces the evaluation of the transformation process need for the RDD they may call on.
- RDD lineage (RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD.
- This graph is mainly made as a result of applying transformations to the RDD.That creates a logical execution plan.
- Logical execution plan starts with the earliest RDDs and also ends with the RDD. Ultimately, that plan produces the result of the action which is only called to execute.
- Actions are one of two ways to send data from executors to the driver. Executors are agents that are responsible for executing different tasks. While a driver coordinates execution of tasks.
- Accordingly, action eliminates the laziness of RDDs and convert that laziness into motion.
Operation in Transformation
MAP:
- Map transformation means to apply operation on each element of the collection.
- In Spark, the Map passes each element of the source through a function and forms a new distributed dataset.
Example:
FlatMap:
- FlatMap is also a transformation operation. When we perform the operation on it, it applies on each RDD and produces new RDD out of it.
- It is quite similar to map function. The difference is, FlatMap operation applies to one element but gives many results out of it.
- That means from single element we may get zero, one, two etc. many results. Flatmap transformation is one step ahead of Map operation.
Example:
FILTER:
- The Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true.
- So, it retrieves only the elements that satisfy the given condition.
Example:
GroupBY Key:
- The groupByKey function is a frequently used transformation operation that performs shuffling of data.
- It receives key-value pairs (K, V) as an input, group the values based on key and generates a dataset of (K, Iterable) pairs as an output.
Example:
reduceBy Key:
- The reduceByKey function is a frequently used transformation operation that performs aggregation of data.
- It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.
Example:
Difference between reduceByKey and groupByKey:
- Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation.
- The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine.
That’s all for Spark RDD
Thankyou.
see you in next blog