RDD’s : Building block of Spark

Reuse of intermediate results across multiple computations is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression. In most frameworks, the only way to reuse data between computations is to write it to an external stable storage system. This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times. Keeping data in memory can improve performance by an order of magnitude.

Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are Immutable resilient distributed collection of records. RDD is the primary data abstraction in Apache Spark and the core of Spark (that many often refer to as Spark Core). Spark is built using Scala around the concept of RDD and provides actions and transformations on top of RDD. RDDs are distributed by design and to achieve even data distribution as well as leverage data locality, they are divided into a fixed number of partitions — logical chunks (parts) of data.

A Spark application consists of a single driver program that runs the user’s main function and a set of executor programs scattered across nodes on the cluster. The driver defines one or more RDDs and invokes actions on them. Workers (aka slaves) are running Spark instances where executors live to execute tasks.

Spark Runtime Workflow

Transformations are functions that take a RDD as the input and produce one or many RDDs as the output. By applying transformations you incrementally build a RDD lineage with all the parent RDDs of the final RDD(s). This is a powerful property: in essence, makes RDD fault-tolerant (Resilient). If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just that partition. A logical plan, i.e. a DAG, is materialized and executed when SparkContext is requested to run a Spark job. Certain transformations can be pipelined which is an optimization that Spark uses to improve the performance of computations.

Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. Simply put, an action evaluates the RDD lineage graph. Actions are one of two ways to send data from executors to the driver (the other being accumulators).

Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors. RDDs get partitioned automatically without programmer intervention. However, there are times when you’d like to adjust the size and number of partitions or the partitioning scheme according to the needs of your application.

Open up the Spark shell and execute a Spark job. Any action is converted into Job which in turn is again divided into Stages, with each stage having its own set of Tasks.

sc.parallelize(1 to 100).count

When a stage executes, you can see the number of partitions for a given stage in the Spark UI.

localhost:4040/stages

By default, the number of partitions is the number of all available cores, and that’s the reason you see 8-tasks. You can request for the minimum number of partitions, using the second input parameter to many transformations.

sc.parallelize(1 to 100, 2).count

An RDD holds a reference to it’s array of partitions, which you can use to find out how many partitions there are

val randRDD = sc.parallelize(1 to 100, 30)
randRDD.partitions.size

Spark needs to launch one task per partition of the RDD. Its best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient.