Apache Spark RDD Operations
We have already discussed about Spark RDD in my Blog Power of Spark in Datascience. In this Blog we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.
Apache Spark RDD Operations
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like
flatMap are transformations.
Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
Bellow are the Examples of Transformation operations in Python:
Map transformation returns a new RDD by applying a function to each element of this RDD
Python Spark map function example
So, in this transformation example, we’re creating a new RDD called “rows” by splitting every row in the baby_names RDD. We accomplish this by mapping over every element in baby_names and passing in a lambda function to split by commas.
FlatMap is similar to map, because it applies a function to all elements in a RDD. But, flatMap flattens the results.
Create a new RDD bye returning only the elements that satisfy the search filter. For SQL minded, think where clause.
UNION(A DIFFERENT RDD)
Simple. Return the union of two RDDs
INTERSECTION(A DIFFERENT RDD)
Again, simple. Similar to union but return the intersection of two RDDs
Another simple one. Return a new RDD with distinct elements within a source DD
Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, actions are RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.
Spark drivers and external storage system store the value of action. It brings laziness of RDD into motion.
Aggregate the elements of a dataset through func
Collect returns the elements of the RDD back to the driver program.
Collect is often used in previously provided examples such as Spark Transformation Examples in Python in order to show the values of the return. Pyspark, for example, will print the values of the array back to the console. This can be helpful in debugging programs.
Number of elements in the RDD
Return the first element in the RDD
Take the first n elements of the RDD.
Works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
Can be much more convenient and economical to use take instead of collect to inspect a very large RDD