Apache Spark RDD Operations

We have already discussed about Spark RDD in my Blog Power of Spark in Datascience. In this Blog we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.

Apache Spark RDD Operations

  • Transformations
  • Actions
Internal Process of RDD

Transformation Operations

Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Bellow are the Examples of Transformation operations in Python:

Transformation Process

MAP(FUNC)

Map transformation returns a new RDD by applying a function to each element of this RDD

Python Spark map function example

So, in this transformation example, we’re creating a new RDD called “rows” by splitting every row in the baby_names RDD. We accomplish this by mapping over every element in baby_names and passing in a lambda function to split by commas.

FLATMAP(FUNC)

FlatMap is similar to map, because it applies a function to all elements in a RDD. But, flatMap flattens the results.

FILTER(FUNC)

Create a new RDD bye returning only the elements that satisfy the search filter. For SQL minded, think where clause.

UNION(A DIFFERENT RDD)

Simple. Return the union of two RDDs

INTERSECTION(A DIFFERENT RDD)

Again, simple. Similar to union but return the intersection of two RDDs

DISTINCT([NUMTASKS])

Another simple one. Return a new RDD with distinct elements within a source DD

Actions Operations:

Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, actions are RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.

Spark drivers and external storage system store the value of action. It brings laziness of RDD into motion.

Overview of Actions

REDUCE(FUNC)

Aggregate the elements of a dataset through func

COLLECT(FUNC)

Collect returns the elements of the RDD back to the driver program.

Collect is often used in previously provided examples such as Spark Transformation Examples in Python in order to show the values of the return. Pyspark, for example, will print the values of the array back to the console. This can be helpful in debugging programs.

COUNT()

Number of elements in the RDD

FIRST()

Return the first element in the RDD

TAKE(N)

Take the first n elements of the RDD.

Works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

Can be much more convenient and economical to use take instead of collect to inspect a very large RDD

Refrences:

https://www.supergloo.com/fieldnotes/apache-spark-action-examples-python/

https://spark.apache.org/docs/latest/programming-guide.html

https://dzone.com/articles/what-is-rdd-in-spark-and-why-do-we-need-it

https://intellipaat.com/tutorial/tutorialspark-tutorial/programming-with-rdds/