Spark Basics : RDDs,Stages,Tasks and DAG

saurabh goyal
4 min readSep 4, 2018

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects.

RDDs

RDD(Resilient,Distributed,Dataset) is immutable distributed collection of objects.RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.

Creating an RDD

val rdd = sc.textFile("/some_file",3)  
val lines = sc.parallelize(List("this is","an example"))

the argument ‘3’ in the method call sc.textFile() specifies the number of partitions

Partitions

RDD are a collection of various data if it cannot fit into a single node it should be partitioned across various nodes. So it means, the more the number of partitions, the more the parallelism. These partitions of an RDD is distributed across all the nodes in the network.

RDDs Operations(Transformations and Actions)

There are two types of operations that you can perform on an RDD- Transformations and Actions. Transformation applies some function on a RDD and creates a new RDD, it does not modify the RDD that you apply the function on.(Remember that RDDs are immutable). Also, the new RDD keeps a pointer to it’s parent RDD.

--

--