Understanding Spark-II — RDD Operations

RDD OPERATIONS

RDD in Apache Spark supports two types of operations:

  • Transformation
  • Actions

i. Transformations

Spark RDD Transformations are functions that take an RDD as the input and produce one or many RDDs as the output. They do not change the input RDD (since RDDs are immutable and hence one cannot change it), but always produce one or more new RDDs by applying the computations they represent e.g. Map, filter, reduceByKey etc.

Transformations are lazy operations on an RDD in Apache Spark. It creates one or many new RDDs, which executes when an Action occurs. Transformation creates a new dataset from an existing one.

Below are some Python Scripts for RDD Transformation-

  1. map() —

lines = sc.parallelize([“hello world”, “hi”])
words = lines.map(lambda line: line.split(“ “))
words.collect()

o/p: [[‘hello’, ‘world’], [‘hi’]]

2. flatmap() —

lines = sc.parallelize([“hello world”, “hi”])
words = lines.flatMap(lambda line: line.split(“ “))
words.collect()

o/p: [‘hello’, ‘world’, ‘hi’]

3. Set Operations-

We can create two sets to see the set transformation operations

set1 = sc.parallelize([“coffee”, “coffee”, “panda”, “monkey”, “tea”])
set2 = sc.parallelize([“coffee”, “money”, “kitty”])

a). distinct()-displays distinct elements of the set.

set1.distinct().collect()

o/p: [‘panda’, ‘coffee’, ‘tea’, ‘monkey’]

b). union()-Performs union of two sets, with duplicates.

set1.union(set2).collect()

o/p: [‘coffee’, ‘coffee’, ‘panda’, ‘monkey’, ‘tea’, ‘coffee’, ‘money’, ‘kitty’]

c). intersection()-performs intersection of two sets.

set1.intersection(set2).collect()

o/p : [‘coffee’]

d). subtract()-elements present in set1, but not in set2.

set1.subtract(set2).collect()

o/p: [‘panda’, ‘tea’, ‘monkey’]

e). filter()

inputRDD = sc.parallelize([“message:”, “error #523”, 
 “cause: seg fault detected”, 
 “please check you are not using windows”])
errorsRDD = inputRDD.filter(lambda x: “error” in x)
errorsRDD.collect()

o/p: [‘error #523’]

ii. Actions

Action in Spark returns final result of RDD computations. It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. Lineage graph is dependency graph of all parallel RDDs of RDD.

Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. An Action is one of the ways to send result from executors to the driver. First, take, reduce, collect, the count is some of the Actions in spark.

Using transformations, one can create RDD from the existing one. But when we want to work with the actual dataset, at that point we use Action. When the Action occurs it does not create the new RDD, unlike transformation. Thus, actions are RDD operations that give no RDD values. Action stores its value either to drivers or to the external storage system. It brings laziness of RDD into motion.

Below are some Python Scripts for RDD Transformation-

  1. reduce()

nums = sc.parallelize([1, 2, 3, 4])
nums.reduce(lambda x, y: x + y)

o/p : 10

2. collect()

nums.map(lambda num: num**2).reduce(lambda x, y: x + y).collect()

3. first()-

nums.map(lambda num: num**2).first()

o/p : 1

4. take() -

nums.map(lambda num: num**2).take(3)

o/p : [1, 4, 9]

You can refer to below links for further information:

http://data-flair.training/blogs/apache-spark-rdd-transformations-actions/

Show your support

Clapping shows how much you appreciated Fravak Bharucha’s story.