Apache Spark Optimisation — Part 1

Dinesh Kumar A S

2 min readNov 15, 2022

--

Image source: https://www.pngwing.com/

Writing an efficient Spark program is mandatory for achieving better performance and not to spend more on the infrastructure.

There are lot of techniques for optimisation like

Choosing right partitions at input, shuffle and output stages
Repartitioning
Persisting and Caching
Handling skews by Salting
Speculative execution

But before we go there (Will be covered in another blog), writing an efficient code irrespective of the nature of data it handles is important.

Following is the basic checklist while writing any spark program (Batch or Streaming)

No extra actions in the flow of the code

Every action triggers a job for itself and it will take time.
If a collect is not needed or it can be achieved by some other ways in the logic of the program, avoid them.

Filter data as early as possible

Always read data which is required for the application
Make use of predicate pushdown feature.
If it has to be read as a whole due to a limitation, then apply filter immediately after reading and keep only the required data for subsequent operations.

Cache - A double edged sword

Cache is very effective in scenarios where you have repetitive actions on an intermediate dataset. It will avoid redundant steps.
But Caching holds memory. If it is really needed and it is small enough to fit in memory, use it. Otherwise, it is better to avoid caching.

No Python UDFs

Catalyst Query Optimiser will not be utilised when we write Python UDFs. So it is always better to avoid UDFs and leverage built-in functions
Scala UDFs are better since it avoids an extra step of serialisation and deserialisation.
If we have to still go for UDFs, we can consider alternate options like Pandas UDFs or using Scala UDFs inside PySpark Program

Can I use broadcast join?

Even though we have many join types, it is always, “When to use broadcast join and when not to use it?”
Use broadcast join only if that can increase performance.
Also, if one of the tables is smaller enough to be broadcasted, broadcast it. Else, it is better not to broadcast to avoid OOM issues.

Prefer foreachPartition over foreach

Evaluate the situation and use the right option.
If you have scenarios like using a database connection for every record in foreach, shift to foreachPartition. It will create a connection for each partition and not for every row like in foreach.

Stick to Spark Built-in functions

The primary advantage of using Apache Spark APIs is “The built-in functions are optimised. There is an optimiser (Catalyst query optimiser) to optimise our query/program if we use Spark built-in functions”
If we write more Python/Scala functions in the PySpark/Spark application, then there is no benefit of using Spark.

See you on another blog!! Leave a clap if you find this interesting!!

Data Engineering

Performance Optimization

Dinesh Kumar A S

Written by Dinesh Kumar A S

Passionate about Big data engineering, Distributed systems, Data warehousing, Data pipelines. https://www.linkedin.com/in/dineshkumaras/

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams