Apache Spark Optimisation — Part 1

Dinesh Kumar A S
2 min readNov 15, 2022

--

Image source: https://www.pngwing.com/

Writing an efficient Spark program is mandatory for achieving better performance and not to spend more on the infrastructure.

There are lot of techniques for optimisation like

  • Choosing right partitions at input, shuffle and output stages
  • Repartitioning
  • Persisting and Caching
  • Handling skews by Salting
  • Speculative execution

But before we go there (Will be covered in another blog), writing an efficient code irrespective of the nature of data it handles is important.

Following is the basic checklist while writing any spark program (Batch or Streaming)

No extra actions in the flow of the code

  • Every action triggers a job for itself and it will take time.
  • If a collect is not needed or it can be achieved by some other ways in the logic of the program, avoid them.

Filter data as early as possible

  • Always read data which is required for the application
  • Make use of predicate pushdown feature.
  • If it has to be read as a whole due to a limitation, then apply filter immediately after reading and keep only the required data for subsequent operations.

Cache - A double edged sword

  • Cache is very effective in scenarios where you have repetitive actions on an intermediate dataset. It will avoid redundant steps.
  • But Caching holds memory. If it is really needed and it is small enough to fit in memory, use it. Otherwise, it is better to avoid caching.

No Python UDFs

  • Catalyst Query Optimiser will not be utilised when we write Python UDFs. So it is always better to avoid UDFs and leverage built-in functions
  • Scala UDFs are better since it avoids an extra step of serialisation and deserialisation.
  • If we have to still go for UDFs, we can consider alternate options like Pandas UDFs or using Scala UDFs inside PySpark Program

Can I use broadcast join?

  • Even though we have many join types, it is always, “When to use broadcast join and when not to use it?”
  • Use broadcast join only if that can increase performance.
  • Also, if one of the tables is smaller enough to be broadcasted, broadcast it. Else, it is better not to broadcast to avoid OOM issues.

Prefer foreachPartition over foreach

  • Evaluate the situation and use the right option.
  • If you have scenarios like using a database connection for every record in foreach, shift to foreachPartition. It will create a connection for each partition and not for every row like in foreach.

Stick to Spark Built-in functions

  • The primary advantage of using Apache Spark APIs is “The built-in functions are optimised. There is an optimiser (Catalyst query optimiser) to optimise our query/program if we use Spark built-in functions
  • If we write more Python/Scala functions in the PySpark/Spark application, then there is no benefit of using Spark.

See you on another blog!! Leave a clap if you find this interesting!!

--

--

Dinesh Kumar A S

Passionate about Big data engineering, Distributed systems, Data warehousing, Data pipelines. https://www.linkedin.com/in/dineshkumaras/