CuriosityX: RDDs — The backbone of Apache Spark

Knoldus Inc.
Knoldus - Technical Insights
6 min readJul 30, 2018

In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark.

So let’s start with the very basic question that comes to our mind i.e What is an RDD?
To answer that let me start with the full form of that.

RDD — Resilient Distributed Datasets

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  • Distributed with data residing on multiple nodes in a cluster.
  • Dataset is a collection of partitioned data.

Now we know what RDD stands for. Now let’s try to understand it.

Let’s start with the definition.

Definition says:

Resilient Distributed Datasets (RDD) is simply a distributed collection of elements and main programming abstraction in spark.

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

RDDs can be created in 2 ways:

1.Parallelizing existing collection

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

Screenshot from 2018-07-29 08-44-00

2. Loading an external dataset from HDFS (or any other HDFS supported file types)

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Screenshot from 2018-07-29 08-44-14

Terminologies Related to RDDs

  • Partition
  • A partition in spark is an atomic chunk of data (a logical division of data) stored on a node in the cluster. Partitions are basic units of parallelism in Apache Spark. RDDs in Apache Spark are a collection of partitions.
  • Executor
  • An executor is a distributed agent that is responsible for executing tasks. Executor typically runs for the entire lifetime of a Spark application which is called static allocation of executors (but you could also opt-in for dynamic allocation).
  • Driver
  • The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
  • SparkContext
  • To execute any operation in spark, you have to first create an object of SparkContext class. A SparkContext class represents the connection to our existing Spark cluster and provides the entry point for interacting with Spark.
  • We need to create a SparkContext instance so that we can interact with Spark and distribute our jobs.
  • Spark RDD Lineage Graph
  • RDDs maintain a graph of 1 RDD getting transformed into another called lineage graph, which helps Spark to recompute any intermediate RDD in case of failures.
    In case we lose some partition of RDD, we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.
  • This characteristic is the biggest benefit of RDD because it saves a lot of efforts in data management and replication and thus achieves faster computations. Also, it helps spark be fault tolerant.
  • Lazy evaluation
Image result for lazy evaluation
  • The data is not loaded until it is necessary for saving memory blocking in advance. Spark uses lazy evaluation to reduce the number of passes it has to take over data by chaining operations together. Spark computes transformations only when an action requires a result for the driver program.

Why RDDs?

Screenshot from 2018-07-29 09-17-32

The Answer is Simple. Data Sharing is Slow in MapReduce.

Screenshot from 2018-07-29 17-11-13.png

Let’s understand it in a bit more detailed way :

  • In most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
  • Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
  • Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce

pastedImage0

Here come RDDs to the rescue :

  • Recognizing this problem, Resilient Distributed Datasets (RDD) supports in-memory processing computation.
  • This means it stores the state of memory as an object across the jobs and the object is shareable between those jobs.
  • Data sharing in memory is 10 to 100 times faster than network and Disk.

Iterative Operations on RDD

pastedImage0 (1)

RDD Operations

RDDs support two types of operations:

  • Transformations
  • Actions.

Transformation Operations

  • Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output.
  • Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately.

There are two types of transformations:

  • Narrow transformation — In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
  • Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().
how-to-build-your-query-engine-in-spark-24-638
  • Action Operations
  • Actions are operations that return a result to the driver program or write it to storage and kick off a computation, such as count() and first().
  • Actions force the evaluation of the transformations required for the RDD they were called on since they need to actually produce output.
  • An action is one of the ways of sending data from Executer to the driver.

Some of the actions in spark include :

  • Reduce
  • Collect
  • Count
  • First
  • Take
  • TakeSample
  • CountByKey
  • SaveAsTextFile

You can explore the documentation to understand a bit more about the actions: here

I hope now you have an idea about what RDDs are and how we can use them.
Just to conclude, we can say that the RDDs are :

  • Immutable
  • Portioned collections of objects spread across a cluster
  • Stored in RAM or on disk
  • Built through lazy parallel transformations and
  • Automatically rebuilt on failure.

References: Apache Spark Documentation

Hope This Helps. Stay Tuned for More. 🙂

knoldus-advt-sticker

--

--

Knoldus Inc.
Knoldus - Technical Insights

Group of smart Engineers with a Product mindset who partner with your business to drive competitive advantage | www.knoldus.com