Spark Basics : RDDs,Stages,Tasks and DAG
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects.
RDDs
RDD(Resilient,Distributed,Dataset) is immutable distributed collection of objects.RDD is a logical reference of a
dataset
which is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.
Creating an RDD
val rdd = sc.textFile("/some_file",3)
val lines = sc.parallelize(List("this is","an example"))
the argument ‘3’ in the method call sc.textFile() specifies the number of partitions
Partitions
RDD are a collection of various data if it cannot fit into a single node it should be partitioned across various nodes. So it means, the more the number of partitions, the more the parallelism. These partitions of an RDD is distributed across all the nodes in the network.
RDDs Operations(Transformations and Actions)
There are two types of operations that you can perform on an RDD- Transformations and Actions. Transformation applies some function on a RDD and creates a new RDD, it does not modify the RDD that you apply the function on.(Remember that RDDs are immutable). Also, the new RDD keeps a pointer to it’s parent RDD.