Managing Spark Partitions with Coalesce and Repartition

Matthew Powers
Nov 29, 2016 · 5 min read

Intro to partitions

val x = (1 to 10).toList
val numbersDf = x.toDF(“number”)
numbersDf.rdd.partitions.size // => 4
numbersDf.write.csv(“/Users/powers/Desktop/spark_output/numbers”)
Partition A: 1, 2
Partition B: 3, 4, 5
Partition C: 6, 7
Partition D: 8, 9, 10

coalesce

val numbersDf2 = numbersDf.coalesce(2)
numbersDf2.rdd.partitions.size // => 2
numbersDf2.write.csv(“/Users/powers/Desktop/spark_output/numbers2”)
Partition A: 1, 2, 3, 4, 5
Partition C: 6, 7, 8, 9, 10

Increasing partitions

val numbersDf3 = numbersDf.coalesce(6)
numbersDf3.rdd.partitions.size // => 4

repartition

val homerDf = numbersDf.repartition(2)
homerDf.rdd.partitions.size // => 2
Partition ABC: 1, 3, 5, 6, 8, 10
Partition XYZ: 2, 4, 7, 9

Increasing partitions

val bartDf = numbersDf.repartition(6)
bartDf.rdd.partitions.size // => 6
Partition 00000: 5, 7
Partition 00001: 1
Partition 00002: 2
Partition 00003: 8
Partition 00004: 3, 9
Partition 00005: 4, 6, 10

Differences between coalesce and repartition

repartition by column

+-----+-------+
| age | color |
+-----+-------+
| 10 | blue |
| 13 | red |
| 15 | blue |
| 99 | red |
| 67 | blue |
+-----+-------+
val people = List(
(10, "blue"),
(13, "red"),
(15, "blue"),
(99, "red"),
(67, "blue")
)
val peopleDf = people.toDF("age", "color")
colorDf = peopleDf.repartition($"color")
Partition 00091
13,red
99,red
Partition 00168
10,blue
15,blue
67,blue

Real World Example

val dataPuddle = dataLake.sample(true, 0.000001)
dataPuddle.write.parquet("s3a://my_bucket/puddle/")
val dataPuddle = dataLake.sample(true, 0.000001)
val goodPuddle = dataPuddle.repartition(4)
goodPuddle.write.parquet("s3a://my_bucket/puddle/")
number_of_partitions = number_of_cpus * 4

You probably need to think about partitions


Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade