4 simple tips to improve your Apache Spark job performance!

Making your Apache Spark application run faster with minimal changes to your code!

Tomas Peluritis

Published in

The Startup

5 min readApr 14, 2020

Introduction

While developing Spark applications, one of the most time-consuming parts was optimization. In this blog post, I’ll give some performance tips and (at least for me) not known configuration parameters I could have used when I’ve started.

So I’m going to cover these topics:

Multiple small files as source
Shuffle Partitions parameter
Forcing broadcast join
Repartition vs Coalesce vs Shuffle partitions parameter setting

What we can improve?

Working with multiple small files?

OpenCostInBytes (from documentation) — The estimated cost to open a file, measured by the number of bytes could be scanned at the same time. This is used when putting multiple files into a partition. It is better to over-estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). The default value is 4MB.

spark.conf.set("spark.files.openCostInBytes", SOME_COST_IN_BYTES)

4 simple tips to improve your Apache Spark job performance!

Making your Apache Spark application run faster with minimal changes to your code!

Introduction

What we can improve?

Working with multiple small files?

Written by Tomas Peluritis