Partitioning in Apache Spark
The following post should serve as a guide for those trying to understand of inner-workings of Apache Spark. I have created it initially for organizing my knowledge and extended later on. It assumes that you, however, possess some basic knowledge of Spark.
All examples are written in Python 2.7 running with PySpark 2.1 but the rules are very similar for other APIs.
First of some words about the most basic concept — a partition:
Partition — a logical chunk of a large data set.
Very often data we are processing can be separated into logical partitions (ie. payments from the same country, ads displayed for given cookie, etc). In Spark, they are distributed among nodes when shuffling occurs.
Spark can run 1 concurrent task for every partition of an RDD (up to the number of cores in the cluster). If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB and a single shuffle block cannot be larger than 2GB (see SPARK-6235).
In general, more numerous partitions allow work to be distributed among more workers, but fewer partitions allow work to be done in larger chunks (and often quicker).
Spark’s partitioning feature is…