RDD in Spark

Apache Spark is an open source big data processing framework built for high speed and sophisticated analytics. Apache spark provides programmers with application programming interface centered on a data structure called the resilient distributed dataset (RDD).

RDD is a fundamental data structures in spark , it is Resilient because it is fault-tolerant with the help of RDD lineage graph and so able to re-compute missing or damaged partitions due to node failures. distributed means each RDD is divided into multiple partitions, each of the partition can reside in memory or stored disk of different machines in a cluster.

Fig1: RDD divided into multiple partitions

· Data are partitioned into logical chunks, this logical division of data is internal and each partition contains chunks of data.

· These chunks of data are distributed to server machines. Server-machines in which the partitioned or chunks of data are stored are called worker nodes.

· Data Partitioning and distribution allows parallel computing in RDD’s.

Fig2: Dataset Partitioned across various server machines/worker nodes in the cluster

· Each worker node computes or performs operation on respective chunk of data . Since RDD is fault tolerant i.e. each Spark RDD remembers the lineage of the deterministic operation that was used on fault-tolerant input dataset to create it. Thus if due to any worker node failure any partition of data is lost then that partition can be re-computed from original fault-tolerant datasets using lineage operations.

Some of the properties of RDD’s are:

1. Immutable: RDD’s are only read only and does not change once created and can only be transformed.

2. Lazy-Evaluated: Data inside RDD is not available or transformed until an action is executed that triggers the execution.

3. Parallel: It processes data in parallel.

4. Partitioned: Records are partitioned.

5. Cacheable: Can hold data in a persistent storage like memory or disk.

Fig3: Fault Resiliency or Fault tolerance (DAG)

Examples for Creating RDD’s using Parallelize in Python:

The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. We can manually set number of partitions.

  1. a = range(1000)
    data =sc.parallelize(a)

Output: 1000

2. lines = sc.parallelize([“hello world”, “hi”])
 words = lines.flatMap(lambda line: line.split(” “))

Output: “hello

3. nums = sc.parallelize([1, 2, 3, 4])
 squared = nums.map(lambda x: x * x).collect()
 for num in squared:
 print “%i ” % (num)

Output: [1,4,9,16]





Related Blogs:


Show your support

Clapping shows how much you appreciated Amogh Damle’s story.