Things You Must Know About Spark RDDs

Introduction, Need and Features Of RDDs

Sanket Gargate
The Startup
4 min readDec 9, 2019

--

What are RDDs ?

Resilient distributed dataset is the core element of the Spark API. RDDs are read-only, partitioned collection of records. They are Distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

Resilient:

Being Immutable & fault-tolerant RDDs are able to recover from the losses.

Distributed:

The data is logically partitioned and distributed across the cluster.

Datasets:

It holds data on which different operations can be performed.

Spark’s Higher level structured APIs like Dataframes or Datasets, After compilation boils down to RDDs. RDDs give programmer complete control because the records are just Java, Scala, or Python objects of the programmer’s choosing. We can store anything we want in these objects, in any format of our choosing.

RDDs let programmers explicitly persist intermediate results in memory, control their partitioning to optimize data storage, and offer two types of operations to manipulate data in a distributed fashion viz. Transformations and Actions.

Transformations construct a new RDD from a previous one and evaluate lazily for e.g. map, filter, join.

Actions are operations that return a value to the application or export data to a storage system and evaluate eagerly for e.g. count,collect,save.

Why Do We Need RDDs ?

There are times when higher-level APIs won’t be able to provide functionality to solve the problem we are facing. Frameworks like MapReduce lack abstractions for leveraging distributed memory. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools where keeping data in memory can improve performance drastically. RDDs are a good fit for many parallel applications, because these applications naturally apply the same operation to multiple data items. RDDs enables efficient data reuse in a broad range of applications.

Features Of RDDs :

Immutable

Once created, RDDs never change. Immutability reduces complexity. Also having immutable collections allows Spark to provide important fault-tolerance guarantees in a straightforward manner.Immutable data is safe to share across multiple processes as well as multiple threads.

Although individual RDDs are immutable, it is possible to implement mutable state by having multiple RDDs to represent multiple versions of a dataset.

Partitioning

Programmers can manually store records partitioned across machines based on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will be joined together are hash-partitioned in the same way.

Fault-Tolerance

Instead Of using data replication, RDDs provide fault tolerance by logging the transformations used to build a dataset (lineage) rather than the dataset itself. If a node fails and an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute. Hence, only a subset of the dataset that resided on the failed node needs to be recomputed. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine grained updates to shared state.

Persistance

Programmers can call a persist method to indicate which RDDs they want to reuse in future operations. We can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist. Once it is computed in an action, it will be kept in memory on the nodes.When we persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset or datasets derived from it. This speed ups future actions (often by more than 10x).

Creation Of RDDs :

RDDs can be created through deterministic operations on 1. data in stable storage such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop Input Format. 2. from other RDDs. 3.parallelizing an existing collection in the driver program.

Performance :

We can cache the RDD which result in performance improvement. RDD APIs are available in Python, Scala and Java. For Scala and Java, the performance similar but using Python can lose a substantial amount of performance when using RDDs.

Due to RDDs,Spark is up to 20× faster than Hadoop for iterative applications, speeds up a real-world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency.

We should not manually create RDDs unless we have a very, very specific reason for doing so. They are a much lower-level API that provides a lot of power but also lacks a lot of the optimizations that are available in the Structured APIs.

--

--

Sanket Gargate
The Startup

Data Science Enthusiast, Avid Reader and Geek Wizard