Member-only story

Should I repartition?

About Data Distribution in Spark SQL.

David Vrba
Towards Data Science
14 min readJun 16, 2020

--

In a distributed environment, having proper data distribution becomes a key tool for boosting performance. In the DataFrame API of Spark SQL, there is a function repartition() that allows controlling the data distribution on the Spark cluster. The efficient usage of the function is however not straightforward because changing the distribution is related to a cost for physical data movement on the cluster nodes (a so-called shuffle).

A general rule of thumb is that using repartition is costly because it will induce shuffle. In this article, we will go further and see that there are situations in which adding one shuffle at the right place will remove two other shuffles so it will make the overall execution more efficient. We will first cover a bit of theory to understand how the information about data distribution is internally utilized in Spark SQL and then we will go over some practical examples where using repartition becomes useful.

The theory described in this article is based on the Spark source code, the version being the current snapshot 3.1 (written in June 2020) and most of it is valid also in previous versions 2.x. Also, the theory and the internal behavior are language-agnostic, so it doesn’t matter whether we use it with Scala, Java, or Python API.

Query Planning

The DataFrame API in Spark SQL allows the users to write high-level transformations. These transformations are lazy, which…

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

David Vrba
David Vrba

Written by David Vrba

Senior ML Engineer at Sociabakers and Apache Spark trainer and consultant. I lecture Spark trainings, workshops and give public talks related to Spark.

Responses (4)