Simple Method to choose Number of Partitions in Spark

Tharun Kumar Sekar
Analytics Vidhya
Published in
2 min readDec 27, 2019

At the end of this article, you will able to analyze your Spark Job and identify whether you have the right configurations settings for your spark environment and whether you utilize all your resources.

Whenever you work on a spark job, you should consider 2 things.

  • Avoid Spill
  • Maximize Parallelism by utilizing all the cores.

Both of these go hand in hand and you should be able to use them to their fullest.

Spills:
Spill happens whenever there is Shuffle and the data has to be moved around and the executor is not able to hold the data in its memory. So it has to use the storage to save the data in memory for a certain time.
When we don’t right size partitions, we get spills. Always avoid Spills. For reading 50 GB, the spill may go as high as 500 GB.

Partitions:

Let's start with some basic default and desired spark configuration parameters.

  • Default Spark Shuffle Partitions — 200
  • Desired Partition Size (Target Size)= 100 or 200 MB
  • No Of Partitions = Input Stage Data Size / Target Size

Below are examples of how to choose the partition count.

Case 1 :

  • Input Stage Data 100GB
  • Target Size = 100MB
  • Cores = 1000
  • Optimal Count of Partitions = 100,000 MB / 100 = 1000 partitions
  • Spark.conf.set(“spark.sql.shuffle.partitions”,1000)
  • Partitions should not be less than number of cores

Case 2:

  • Input Size Data — 100GB
  • Target Size = 100MB
  • Cores = 96
  • Optimal Count of Parititons = 100,000 MB / 100 MB = 1000 partitions
  • Spark.conf.set(“spark.sql.shuffle.partitions”,960)
  • When partition count is greater than Core Count, partitions should be a factor of the core count. Else we would be not utilizing the cores in the last run.

Input:

  • Read the input data with the number of partitions, that matches your core count
  • Spark.conf.set(“spark.sql.files.maxPartitionBytes”, 1024 * 1024 * 128) — setting partition size as 128 MB
  • Apply this configuration and then read the source file. It will partition the file into multiples of 128MB
  • To verify df.rdd.partitions.size
  • By doing this, we will be able to read the file pretty fast

Output:

  • When saving down the data, try to utilize all the cores.
  • If the number of partitions matches the core count or is a factor of core count, we will achieve parallelism which in turn will reduce the time.

Takeaways:

  • Too few partitions will lead to less concurrency.
  • Too many partitions will lead to a lot of shuffles.
  • Partition count in common lies between 100 and 10,000.
  • Lower Bound: At least ~2x number of cores in the cluster.
  • Upper Bound: Ensure tasks take at least 100ms.

--

--