Spark Performance Optimization Series: #3. Shuffle

Apache Spark optimization techniques for better performance

Source: Planning above and beyond

A Shuffle operation is the natural side effect of wide transformation. We see that with wide transformations like, join(), distinct(), groupBy(), orderBy() and a handful of others.

It does not matter at which stage we are, data needs to be read in from somewhere. Data can be read in from source or from previous stage. As the data is being read in it is going to get processed in a way that is unique to the algorithm that is requesting the shuffle operation. If we look at the groupBy() operation let’s say groupBy colour, the mapping operation is going to identify which record belongs to which color as seen from the above diagram in map phase.

Then the data will be moved to stage 2 and in stage 2 it will bring all the like records together into a partition. Then the records will go through some kind of reduce operation, let’s say groupBy() and then count() then the reduce operation will be the count of those records. Then like every case we conclude the stage by writing it back out. Either we finish the job and do some kind of termination or we are writing out to yet another set of shuffle files for the next stage after this.

The real performance hit here is on the map side we are doing data processing and writing it to disk and on the read side we are actually asking the other side to read that back from disk sending it back across the wire between executors and then do yet more processing on it.

Important point to note with Shuffle is not all Shuffles are the same.

distinct — aggregates many records based on one or more keys and reduces all duplicates to one record.

groupBy / Count — Combination aggregates many records based on a key and then returns one record which is the count of that key.

join — takes two datasets , aggregates each of those by a common key and produces one record for each matching combination.

crossJoin — takes two datasets, aggregates each of those by a common key, and produces one record for every possible combination. Very heavy and expensive shuffle operation.

Also there are similarities between the different shuffle operations.

  • They read data from some source.
  • They aggregate records across all partitions together by some key.
  • The aggregated records are written to disk (Shuffle files).
  • Each executors read their aggregated records from the other executors.
  • This requires expensive disk and network IO.

What can we do to mitigate Performance issues caused by Shuffle? 🤔

  1. Our number one strategy is to reduce network IO by using fewer and larger workers. The larger the machine and the fewer machines we have the less data that we have to move between those machines. We will still incur the disk IO but having fewer machine we are going to significantly reduce the network IO.
  2. The next strategy is reduce the amount of data being shuffled as a whole. few of the things that we can do are: get rid of the columns that you don’t need, filter out unnecessary records, optimize data ingestion.
  3. De-normalize the datasets specifically if the shuffle is caused by a join.
  4. If you are joining tables you can employ a BroadcastHashJoin in which case the smaller of the two tables is redistributed to the executors to avoid the shuffle operation entirely. This is controlled by spark.sql.autoBroadcastJoinThreshold property (default setting is 10 MB). If the smaller of the two tables meet the 10 MB threshold than we can Broadcast it.
  5. For joins, pre-shuffle the data with a bucketed dataset. The goal is to eliminate the exchange & sort by pre-shuffling the data. The data is aggregated into N buckets and optionally sorted and the result is saved to a table and available for subsequent reads.

To be continued……….

Read more about these strategies at:

https://databricks.com/session_na20/on-improving-broadcast-joins-in-apache-spark-sql

--

--

--

road to data engineering is a publication which publishes articles related to data engineering tools and technologies to share knowledge on big data engineering which is a crucial process which enables data scientists and data analysts to create value.

Recommended from Medium

Grab a Big Data Surfboard or Get Left Behind!

Culvert Design Theory

How to create Football Pitches/Goals as Backgrounds in Tableau

Climate change visualization with Plotly

Building an ETL pipeline with Airflow and ECS

How I Improved Performance Retrieving Big Data With S3-Select

Happy though undecided !

Acubed — Project Monark Goes to Wyoming

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Himansu Sekhar

Himansu Sekhar

Data Engineering | DevOps | DataOps | Distributed Computing

More from Medium

Apache Spark 3.0 Exciting Capabilities

Leet Code Problem : Reserve consecutive available seats : Using Snowflake and pyspark

Archiving Parquet files

Increasing Apache Spark read performance for JDBC connections