4 simple tips to improve your Apache Spark job performance!

Making your Apache Spark application run faster with minimal changes to your code!

Tomas Peluritis
The Startup

--

Photo by Giorgio Trovato on Unsplash

Introduction

While developing Spark applications, one of the most time-consuming parts was optimization. In this blog post, I’ll give some performance tips and (at least for me) not known configuration parameters I could have used when I’ve started.

So I’m going to cover these topics:

What we can improve?

Working with multiple small files?

OpenCostInBytes (from documentation) — The estimated cost to open a file, measured by the number of bytes could be scanned at the same time. This is used when putting multiple files into a partition. It is better to over-estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). The default value is 4MB.

spark.conf.set("spark.files.openCostInBytes", SOME_COST_IN_BYTES)

--

--

Tomas Peluritis
The Startup

Professional Data Wizard— Data Engineering/DWH/ETL/BI/Data Science.