Spark Tuning, Optimization, and Performance Techniques

Navanee
5 min readJan 4, 2022

--

8 Performance Optimization Techniques Using Spark

  1. Serialization. Serialization plays an important role in the performance for any distributed application. …
  2. API selection. …
  3. Advance Variable. …
  4. Cache and Persist. …
  5. ByKey Operation. …
  6. File Format selection. …
  7. Garbage Collection Tuning. …
  8. Level of Parallelism.
  • changing the maxSimultaneousSubmitAndMonitorThreadsInDriver, in order to throttle the number of threads that are submitting and monitoring an app at a given time
  • trying to increase spark.scheduler.listenerbus.eventqueue.capacity
  • trying to increase / decrease spark.default.parallelism
  • trying to increase / decrease spark.sql.shuffle.partitions

If I increase the number of threads that can submit and monitor Spark apps simultaneously (with a throttle system), I end up with OOM.

Regarding spark.default.parallelism and spark.sql.shuffle.partitions, I don't know how to choose a relevant value. If I do NOT Scheduling Within (with only one application per driver) the value I set would probably be 192 (the number of cores) to have good results.

To set the context, let me describe the three main Spark application entities — Driver, Cluster Manager, and Cache:

  1. The Driver sits on a node in the cluster and runs your main Spark function. It also maintains Spark application information; responds to user input; and analyzes, distributes, and schedules work across executors.
  2. The Cluster Manager acts as the liaison between the Spark Driver and executors. Executors are responsible for running tasks and reporting back on their progress. The Cluster Manager can be the default scheduler from Spark, Yarn, Kubernetes, or Mesos.
  3. The Cache is shared between all tasks running within the executor.

Now let’s look at some of the ways Spark is commonly misused and how to address these issues to boost Spark performance and improve output.

Data skew

Data skew is probably the most common mistake among Spark users. Data is skewed when data sets aren’t properly or evenly distributed. Skewed data can impact performance and parallelism. The main reason data becomes skewed is because various data transformations like join, groupBy, and orderBy change data partitioning. This can cause discrepancies in the distribution across a cluster, which prevents Spark from processing data in parallel.

Data skew can cause performance problems because a single task that is taking too long to process gives the impression that your overall Spark SQL or Spark job is slow. Actually, it’s only a problem with one task, or more accurately, with skewed data underlying that task. Once the skewed data problem is fixed, processing performance usually improves, and the job will finish more quickly.

The rule of thumb is to use 128 MB per partition so that tasks can be executed quickly. The associated costs of reading underlying blocks won’t be extravagant if partitions are kept to this prescribed amount. If partitions are kept to this amount, it’s possible to execute large numbers of jobs in parallel, which is ultimately more efficient than trying to overload one particular partition.

Executor misconfiguration

Executors can run several Spark tasks in parallel. Although conventional logic states that the greater the number of executors, the faster the computation, this isn’t always the case. The reality is that more executors can sometimes create unnecessary processing overhead and lead to slow compute processes.

How does this happen? Although Spark users can create as many executors as there are tasks, this can create issues with cache access. If there are too many executors created. individual executors will need to query the data from the underlying data sources and don’t benefit from rapid cache access.

The second common mistake with executor configuration is to create a single executor that is too big or tries to do too much. This can create memory allocation issues when all data can’t be read by the single task and additional resources are needed to run other processes that, for example, support running the OS. Dynamic allocation can help, but not in all cases.

The best way to think about the right number of executors is to determine the nature of the workload, data spread, and how clusters can best share resources. Dynamic allocation can help by enabling Spark applications to request executors when there is a backlog of pending tasks and free up executors when idle.

Join/Shuffle

Spark applications require significant memory overhead when they perform data shuffling as part of a group or as part of the join operations. Remember that normal data shuffling is handled by the executor process, and if the execute activity is overloaded, it can’t handle shuffle requests. This issue can be handled with an external shuffle service. Keep in mind that data skew is especially problematic for data sets with joins. Joins can quickly create massive imbalances that can impact queries and performance.

Cartesian products frequently degrade Spark application performance because they don’t handle joins well. By using nested structures or types, you will be able to declare dealing with fewer numbers of rows at every stage, rather than moving data around.

The key is to fix the data layout. Salting the key to distribute data is the best option. One needs to pay attention to the reduce phase as well, which reduces the algorithm in two stages — first on salted keys, and secondly to reduce unsalted keys. Another strategy is to isolate keys that destroy the performance, and compute them separately.

Also, note that a Spark external shuffle often initiates an auxiliary service which will act as an external shuffle service. The NodeManager memory is about 1 GB, and apps that do a lot of data shuffling are liable to fail due to the NodeManager using up memory capacity. This brings up issues of configuration and memory, which we’ll look at next.

Memory issues

Spark users will invariably get an out-of-memory condition at some point in their development, which is not unusual. Spark is based on a memory-centric architecture. These memory issues are typically observed in the driver node, executor nodes, and in the NodeManager.

Note that Spark’s in-memory processing is directly tied to its performance and scalability. In order to get the most out of your Spark applications and data pipelines, there are a few things you should try when you encounter memory issues.

First off, driver shuffles are to be avoided at all costs. ReduceByKey should be used over GroupByKey, everything that goes into the shuffle memory of the executor, so avoid that all the time at all costs. TreeReduce is any day better than standard Reduce. Complex and Nested Structures should be used over Cartesian Joins. Ordering of data particularly for historical data.

--

--