I started using Spark a few months ago with little knowledge in Distributed Systems and like many beginners, I succumbed to common pitfalls of Distributed Programming. This is the first post of a series of articles I am planning to write, mostly for consolidating my own knowledge.
May you find interest =)
RDD Partitioning Behavior
Spark internally stores the RDD partitioning information (that is the strategy for assigning individual records to independent parts aka partitions) on the RDD itself. Inside the RDD it is represented as a reference to an instance of a Partitioner class which can be compared to establish whether or not multiple RDDs have the same partitioning.
Understanding and having control over Data Partitioning is required for writing performant Spark programs as all Spark methods operates at partition-level. Things can speed up greatly when data is partitioned the right way but can dramatically slow down when done wrong, especially due the infamous Shuffle operation.
What I wanted to share today is the behavior of RDDs regarding partitioning when being manipulated.
For instance, each time you perform a map operation on an RDD, it “loses” its partitioning information.
Thinking twice, this feels pretty obvious as a map operation applies a “transformation” to some data, so the partitioning is directly affected. But when dealing with a PairRDD (RDD of key/value pairs) on which only values is manipulated, one could assume the partitioning would last; Wrong. Spark cannot guess that by itself.
If like me you didn’t pay much attention to the plethora of available RDD methods at first, you will be happy to know about the mapValues method that specifically preserves partitioning when dealing with values of a PairRDD.
I was also happy to find out that the filter method preserves partitioning for you (feels obvious now…)
Spark source-code is crystal clear on these behaviors.
map/filter operations instantiates a MapPartitionsRDD which has partition persistence set to false by default:
map instantiates a MapPartitionsRDD with default partition persistence:
mapValues instantiates a MapPartitionsRDD with partition persistence set to true:
filter instantiates a MapPartitionsRDD with partition persistence set to true:
I found that reading Spark source-code was the best way to understand how it works. It is spectacularly well written and very easy to read, so I can only encourage others to do the same.
I work at Zenly, a mobile app for geolocating your friends in real-time (and more). Interested in discussing Spark and Distributed Systems ? Ping me on twitter @corentinanjuna, I’ll be so happy to exchange feedbacks =)