Spark: Repartition vs Coalesce, and when you should use which

Vikas kumar
3 min readFeb 13, 2022

--

If you are into Data Engineering and are using Spark, then you must have heard of Repartition and Coalesce. And it is important to understand the difference between them and when to use which one, as both of them seem to serve the same purpose.

Repartition:

Repartition is a method in spark which is used to perform a full shuffle on the data present and creates partitions based on the user’s input.

Repartition method can be done in 2 ways:

  1. Number of partitions: You can specify how many partitions you want by passing an integer value to the method.
  2. Column: You can specify the column based on which you wish to do the repartition.
df = df.repartition(3)           # Num of partitions
df = df.repartition('column') # Column

Coalesce:

Coalesce is another method to partition the data in a dataframe. This is mainly used to reduce the number of partitions in a dataframe and avoids shuffle.

df = df.coalesce(2)

Difference:

  1. Repartition does full shuffle of data, coalesce doesn’t involve full shuffle, so its better or optimized than repartition in a way.
  2. Repartition increases or decreases the number of partitions, coalesce only decreases the number of partitions.
  3. Repartition works by creating new partitions and doing a full shuffle to move data around. While in coalesce if number of partitions is to be reduced from 5 to 2, it will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors, avoiding full shuffle.
  4. Because of above reason, the partitions in repartition are equal sized, while for coalesce those vary in sizes and thus are skewed.
  5. Because coalesce avoids shuffle(well most of it) so it gives better performance in most cases(not all).

Use Cases

While it may seem that Coalesce is better than Repartition because it avoids shuffle, but in many cases you will see better performance with Repartition, compared to Coalesce. This is because Coalesce can skew the partitions(uneven distribution) by very large proportions, and we know skewed partitions are not good for performance.

So, it is important to understand the requirement before using either of these methods. Below are some points which can help in this decision.

Use Repartition when:

  1. You want your data to be evenly distributed across partitions.
  2. You have to increase the number of partitions.
  3. After filtering operations are done on the dataframe.
  4. Your partitions are skewed.

Use Coalesce when:

  1. You want to decrease the number of partitions
  2. You want to avoid shuffle of data.

Under the hood, both repartition and coalesce are basically same functions with one difference that shuffle=false in coalesce, in other words repartition(n) is equivalent to coalesce(n, shuffle = true).

Conclusion:

To conclude, you should use Coalesce when you want to decrease the number of partitions, and use Repartition when you want to increase partitions or want even distribution.

--

--