How Apache Spark’s Transformations And Action works…

5 min readJul 12, 2017

In previous blog you know that transformation functions produce a new Resilient Distributed Dataset (RDD). Resilient distributed datasets are Spark’s main programming abstraction and RDDs are automatically parallelized across the cluster.

Note: as you would probably expect when using Python, RDDs can hold objects of multiple types because Python is dynamically typed.

What are the RDD Operations

Transformations
Actions

Transformation

Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable). Operations like map, filter, flatMap are transformations.

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Spark has certain operations which can be performed on RDD. An operation is a method, which can be applied on a RDD to accomplish certain task. RDD supports two types of operations, which are Action and Transformation. An operation can be something as simple as sorting, filtering and summarizing data.

There are two types of transformations:

Narrow transformation — In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.

Actions

Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, actions are RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.

Spark drivers and external storage system store the value of action. It brings laziness of RDD into motion.

An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.

Example

MAP(func)

Map transformation returns a new RDD by applying a function to each element of this RDD

>>> baby_names = sc.textFile(“baby_names.csv”)

>>> rows = baby_names.map(lambda line: line.split(“,”))

So, in this transformation example, we’re creating a new RDD called “rows” by splitting every row in the baby_names RDD. We accomplish this by mapping over every element in baby_names and passing in a lambda function to split by commas. From here, we could use Python to access the array

>>> for row in rows.take(rows.count()): print(row[1])

First Name

DAVID

JAYDEN

FLATMAP(func)

flatMap is similar to map, because it applies a function to all elements in a RDD. But, flatMap flattens the results.

Compare flatMap to map in the following

>>> sc.parallelize([2, 3, 4]).flatMap(lambda x: [x,x,x]).collect()

[2, 2, 2, 3, 3, 3, 4, 4, 4]

>>> sc.parallelize([1,2,3]).map(lambda x: [x,x,x]).collect()

[[1, 1, 1], [2, 2, 2], [3, 3, 3]]

This is helpful with nested datasets such as found in JSON.

Adding collect to flatMap and map results was shown for clarity. We can focus on Spark aspect (re: the RDD return type) of the example if we don’t use collect:

>>> sc.parallelize([2, 3, 4]).flatMap(lambda x: [x,x,x])

PythonRDD[36] at RDD at PythonRDD.scala:43

FILTER(func)

Create a new RDD bye returning only the elements that satisfy the search filter. For SQL minded, think where clause.

>>> rows.filter(lambda line: “MICHAEL” in line).collect()

[[u’2013', u’MICHAEL’, u’QUEENS’, u’M’, u’155'],

[u’2013', u’MICHAEL’, u’KINGS’, u’M’, u’146'],

[u’2013', u’MICHAEL’, u’SUFFOLK’, u’M’, u’142']…

count()

Action count() returns the number of elements in RDD.

For example: RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd.count()” will give the result 8.

Count() example:

val data = spark.read.textFile("spark_test.txt").rdd

val mapFile = data.flatMap(lines => lines.split(" ")).filter(value => value=="spark")

println(mapFile.count())

collect()

The action collect() is the common and simplest operation that returns our entire RDDs content to driver program. The application of collect() is unit testing where the entire RDD is expected to fit in memory. As a result, it makes easy to compare the result of RDD with expected result.

Action Collect() had a constraint that all the data should fit in the machine, and copies to the driver.

Collect() example:

val data = spark.sparkContext.parallelize(Array(('A',1),('b',2),('c',3)))

val data2 =spark.sparkContext.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))

val result = data.join(data2)

println(result.collect().mkString(","))

Conclusion

In conclusion, on applying a transformation to an RDD creates another RDD. As a result of this RDDs are immutable in nature. On the introduction of an action on an RDD, the result gets computed. Thus, this lazy evaluation decreases the overhead of computation and make the system more efficient.

If you have any query about RDD Transformations and Action APIs, So, feel free to share with us. We will be happy to solve them.

Reference

https://www.supergloo.com/fieldnotes/apache-spark-transformations-python-examples/

https://jaceklaskowski.gitbooks.io/mastering-apache-spark-2/spark-rdd.html

http://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf