Spark Performance Optimization Series: #2. Spill

Apache Spark optimization techniques for better performance

Himansu Sekhar
road to data engineering
3 min readDec 28, 2020

--

Photo by Thorn Yang from Pexels

The easier way to think about spill is, for every task there is a corresponding partition and if that task can not process that partition with memory allocated to store that particular partition the data that represents the partition is spilled to disk or written to disk and read back again.

Spill is the term used to refer to the act of moving an RDD from RAM to disk, and later back into RAM again.

This occurs when a given partition is simply too large to fit into RAM. In the previous article we saw that skew is a great example of inducing this problem.

The consequence of this is, Spark is forced into expensive disk reads and writes to free up local RAM to avoid the Out of Memory error which can crash the application.

Spill can be induced by a number of different ways:

→ By altering the spark.sql.files.maxPartitionBytes where the default is 128 MB as a partition read into Spark, by reading it much higher like in 1 Gigabyte range, the active ingestion may not induce a spill because we have to cross the shuffle boundary, but what it will do is the moment we have a wide transformation those partitions will be too large and cause Spill.

→ Another way is the explode() operation. If we have datasets where there is a column which have a array option in it and if we explode that array, join() and crossJoin() specifically crossJoin() of two tables could potentially create a very large partition.

→ Also we saw in our previous article that aggregating results by a skewed feature can create large partitions and cause Spill.

Spill is represented by two values: (These two values are always presented together.)

Spill (Memory): is the size of the data as it exists in memory before it is spilled.

Spill (Disk): is size of the data that gets spilled, serialized and, written into disk and gets compressed.

When we do have an instance of Spill it’s represented in two places:

→ On the details page of a specific stage in Spark UI.

  • In the Summary metrics
  • Aggregated metrics by executor
  • The Tasks table

→ Also you can see the Spill details in the corresponding query plan or query details.

What can we do to mitigate Spill? 🤔

→ Easiest and Costly way is to allocate a cluster with more memory per worker.

→ In case of Spill caused by skew address the skew and the Spill will be addressed.

→ Decrease the size of each partition by increasing the number of partitions

  • By managing spark.sql.shuffle.partitions
  • By explicitly reparitioning
  • By managing spark.sql.files.maxPartitionBytes
  • Note that this strategy is not effective against skew, you need to fix the skew first in case of Spill caused by skew.

To be continued……….

Read more about these strategies at:

https://databricks.com/session_na20/fine-tuning-and-enhancing-performance-of-apache-spark-jobs

--

--

Himansu Sekhar
road to data engineering

Data Engineering | DevOps | DataOps | Distributed Computing