Beneath RDD(Resilient Distributed Dataset) in Apache Spark

Gangadhar Kadam
7 min readSep 3, 2018

--

  • is the primary data abstraction in Apache Spark and the core of Spark that we often refer to as “Spark Core”.
  • A RDD is a resilient and distributed collection of records spread over one or many partitions
  • One could compare RDDs to collections in Scala, i.e. a RDD is computed on many JVMs while a Scala collection lives on a single JVM.
  • RDDs are a container of instructions on how to materialize big (arrays of) distributed data, and how to split it into partitions so Spark (using executors) can hold some of them.
  • RDD’s are the Fundamental unit of data in spark
  • An RDD belongs to one and only SparkContext. You can not share RDD’s between contexts.

The features of RDDs (decomposing the name):

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  • Distributed with data residing on multiple nodes in a cluster.
  • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (python, scala or java)

Beside the above traits (that are directly embedded in the name of the data abstraction — RDD) it has the following additional traits:

  • In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as possible.
  • Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.
  • Lazy evaluated i.e. transformation on RDD, don’t get performed immediately. Spark internally records the metadata to track the operations. Lazy evaluation reduce the number of passes on the data by grouping the operations.
T-Transformation, A-Action
  • Cacheable, i.e. you can hold all the data in a persistent “storage” like memory (default and the most preferred) or disk (the least preferred due to access speed).
  • Parallel, i.e. process data in parallel.
  • Typed — RDD records have types, e.g. Long in RDD[Long] or (Int, String) in RDD[(Int, String)].
  • Partitioned — records are partitioned (split into logical partitions) and distributed across nodes in a cluster.
  • Location-StickinessRDD can define placement preferences to compute partitions (as close to the records as possible).

The motivation to create RDD

were two types of applications that current computing frameworks handle inefficiently:

  • iterative algorithms in machine learning and graph computations.
  • interactive data mining tools as ad-hoc queries on the same dataset.

The goal is to reuse intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network.

Intrinsic properties of the RDD:

abstract class RDD[T] {
def compute(split: Partition, context: TaskContext): Iterator[T]
def getPartitions: Array[Partition]
def getDependencies: Seq[Dependency[_]]
def getPreferredLocations(split: Partition): Seq[String] = Nil
val partitioner: Option[Partitioner] = None
}
  • List of parent RDD’s that are the dependencies of the RDD.
  • An array of partitions that a dataset is divided to.
  • A compute function to do a computation on partitions.
  • An optional Partitioner that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)
  • Optional locality info, i.e. hosts for a partition where the records live or are the closest to read from.

Types of RDDs

There are some of the most interesting types of RDDs:

  • ParallelCollectionRDD
  • CoGroupedRDD
  • HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS using the older MapReduce API. The most notable use case is the return RDD of SparkContext.textFile.
  • MapPartitionsRDD — a result of calling operations like map, flatMap, filter, mapPartitions, etc.
  • CoalescedRDD — a result of repartition or coalesce transformations.
  • ShuffledRDD — a result of shuffling, e.g. after repartition or coalesce transformations.
  • PipedRDD — an RDD created by piping elements to a forked external process.
  • PairRDD (implicit conversion by PairRDDFunctions) that is an RDD of key-value pairs that is a result of groupByKey and join operations.
  • DoubleRDD (implicit conversion as org.apache.spark.rdd.DoubleRDDFunctions) that is an RDD of Double type.
  • SequenceFileRDD (implicit conversion as org.apache.spark.rdd.SequenceFileRDDFunctions) that is an RDD that can be saved as a SequenceFile.

Appropriate operations of a given RDD type are automatically available on a RDD of the right type, e.g. RDD[(Int, Int)], through implicit conversion in Scala.

RDD Operations

There are two types of operations, we can perform on RDDs. They are transformations and Actions

A transformation

  • is a lazy operation on a RDD that returns another RDD, like map, flatMap, filter, reduceByKey, join, cogroup, etc.
  • There are two kinds of transformations: narrow transformation, wide transformation.

Narrow Transformations:

  • Common Narrow transformations are map, flatMap, MapPartitions, filter, sample & union
  • In narrow transformations the input & output data stays in the same partition, i.e. it is self-sufficient.
  • An output RDD has partitions with records that originate from a single partition in the parent RDD.
  • Only a limited subset of partitions used to calculate the result.
  • Spark groups narrow transformations as a stage known as pipelining.

Wide Transformations:

  • Common wide transformation are — gropupByKey, ReduceByKey, coalesce, repartition, join, intersection
  • The data required to compute the records in may live in many partitions of the parent RDD.
  • Wide transformations are also known as shuffle transformations because data shuffle is required to process the data.

An action is an operation that triggers execution of RDD transformations and returns a value (to a Spark driver — the user program).

Common Transformations & Actions

Why to use RDD

  1. Offer control flexibility
  2. Low level API
  3. Type-Safe
  4. Encourage how to

When to use RDD

  • Low level API and control of dataset
  • Dealing with unstructured Data
  • Manipulate data with lambda function
  • Don’t care schema or structure of the data
  • Sacrifice performance, optimization and inefficiencies

Limitations of RDD

  • Express how to solution but not what to
  • Not optimized — When working with structured data, RDDs cannot take advantages of Spark’s advanced optimizers including catalyst optimizer and Tungsten execution engine. Developers need to optimize each RDD based on its attributes.
  • Schema Infer — Unlike Dataframe and datasets, RDDs don’t infer the schema of the ingested data and requires the user to specify it.
  • Slow Performance — Being in-memory JVM objects, RDDs involve the overhead of Garbage Collection and Java Serialization which are expensive when data grows.
  • Storage limitation — RDDs degrade when there is not enough memory to store them.
  • Slow for non-JVM languages like python

Ways to Create RDD

  1. Using parallelized collection:
  • Parallelized collections are created by calling SparkContext’s
sc.paralleleize(col, slices)
sc.makeRDD(coll, slices)
sc.range(start, end, step, slices)
  • The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
  • To operate this method, we need entire dataset on one machine. Due to this property, this process is rarely used outside of testing and prototyping.

For example,

2. From external datasets:

  • Spark RDD’s can be created from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
  • Text file RDDs can be created using sc.textFile(name, partitions)method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.
  • RDD of pairs of a file and its content from a directory can can be created using sc.wholeTextFiles(name, partitions). This method lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

Some notes on reading files with Spark:

  • If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
  • All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
  • The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

3. From existing apache spark RDDs.

we can create different RDD from the existing RDDs. This process of creating another dataset from the existing ones means transformation. As a result, transformation always produces new RDD. As they are immutable, no changes take place in it if once created. This property maintains the consistency over the cluster.

Some of the operations performed on RDD are map, filter, count, distinct, flatmap etc.

Conclusion

In conclusion to RDD, the shortcomings of Hadoop MapReduce was so high. Hence, it was overcome by Spark RDD by introducing in-memory processing, immutability etc. But there were some limitations of RDD. For example No inbuilt optimization, storage and performance limitation etc.

Because of the above-stated limitations of RDD to make spark more versatile DataFrame and Dataset evolved.

Reference:

http://spark.apache.org/

--

--