Spark optimizations. Part I. Partitioning

Roman Krivtsov
Thinkport Technology Blog
4 min readSep 2, 2021

This is the series of posts about Apache Spark for data engineers who are already familiar with its basics and wish to learn more about its pitfalls, performance tricks, and best practices.

And in this first part we’ll consider partitioning — the key to a better performance of a large set of data. It allows your cluster to parallelize a query among nodes. Having just more partitions doesn’t mean that you increase performance, it’s also important how you work with them. So let’s consider some common points and best practices about Spark partitioning.

Pick the right number and size of partitions

The number of partitions should not be less than the total number of cores within the cluster. Otherwise you’re not going to use all the capabilities of the cluster and basically lose money.

From the other side a very high level of parallelism may turn into a problem called distribution overhead. To manage so many tasks at the same time the cluster must:

  • Fetch data, seek disks
  • Shuffle data
  • Distribute tasks
  • Track tasks

At some point the cost of these operations may become more than profit from parallelization.

As a common recommendation you should have 2–3 tasks per CPU core, so maximum number of partitions can be = number of CPUs * 3

At the same time a single partition shouldn’t contain more than 128MB and a single shuffle block cannot be larger than 2GB.

Increasing number of partitions with repartition() and repartitionByRange()

We see that partitions play sufficient role in data processing. So Spark, being a powerful platform, gives us methods to manage partitions of the fly. There are two main partitioners in Apache Spark:

  • HashPartitioner is a default partitioner. It corresponds to the repartition() method. It will store data evenly across all the partitions. You can define number of partitions and columns to partition as arguments.
  • RangePartitioner works similarly but will distribute data across partitions based on a range. It corresponds the method repartitionByRange() and use a column that will be used as partition key. Spark will analyze the column (but not scanning all the values but just sampling it due to obvious performance reasons, so be aware) and distribute values based on it

Lets take an example. Lets say we have a monotonically increasing column id, then we add a new column based on Spark internal partition id to see the actual storage and some aggregations to visualize it:

val df = Seq(0 to 1000000: _*).toDF("id")
df
.withColumn("partition", spark_partition_id())
.groupBy(col("partition"))
.agg(
count(col("id")).as("count"),
min(col("id")).as("min_value"),
max(col("id")).as("max_value"))
.orderBy(col("partition"))

Then lets apply repartition() function with number of partitions as 4

df.repartition(4, col("value"))

We’ll see the following:

+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |249911|12 |1000000 |
|1 |250076|6 |999994 |
|2 |250334|2 |999999 |
|3 |249680|0 |999998 |
+---------+------+---------+---------+

As you can see that we have 4 partitions and each contains similar amount of data randomly distributed. Now let’s apply

df.repartitionByRange(4, col("value"))

So the result will be:

+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |244803|0 |244802 |
|1 |255376|244803 |500178 |
|2 |249777|500179 |749955 |
|3 |250045|749956 |1000000 |
+---------+------+---------+---------+

Now values are stored by monotonically increasing chunks.

Decreasing number of partitions with coalesce()

As we’ve seen before, partitions can not only increase performance but also slow down the process because of shuffling among nodes. In some cases it might be useful to move data to just one or couple of nodes/partitions and process it there, because data is physically closer to each other. So here comes the coalesce() functions, which avoids full shuffle and tries to decrease number of partitions in a smart way. So let’s say we have the followng data set:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

After coalesce(2) we’re going to have

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

So we didn’t need to touch Node 1 and Node 3 to regroup the data.

Balancing of the partitions

After active shuffling of data (operations like groupBy, sort, join) your partitions are going to have not equal amount of data, some of them can contain nothing, some — too much. So you can either repartition your dataset using repartitionByRange or if you use Spark 3 just enable adaptive query execution which will balance out the partitions for you automatically.

So it was the common recommendation regarding your partitions management, the most tricky part is of course to choose the right set up for the certain data, which should be considered individually based on structure of you data. In the next parts of the series, where we consider transformation, low-level and platform specific optimization. So stay tuned!

If your company needs support with Big Data processing or building a Data Lake, we at Thinkport are always ready to help. We also order many workshops including Apache Spark with a deeper approach and hands-on experience.

--

--