Performance Optimization in Apache Spark

3 min readNov 30, 2021

Eventhough Spark is equiped with In Memory processing technique and we spend more on hardware, We need to optimize our spark code to achieve better result.

We have lot of optimization technique, will see some of them for the next few days.

1. Shuffle partition size :

Based on our dataset size, number of cores, and memory, Spark shuffling can benefit or harm the jobs.

When we are dealing with less amount of data, we need to reduce the shuffle partitions otherwise we will end up with many partitioned files with a fewer number of records in each partition. which results in running many tasks with lesser data to process.

On other hand, when we have too much of data and having less number of partitions results in few longer running tasks and some times we may also get out of memory error.

2. Bucketing :

The Bucketing is commonly used to optimize performance of a join query by avoiding shuffles of tables participating in the join.

df.write.bucketBy(n, “column”).saveAsTable(“bucketed_table”)

The above code will create a table (bucketed table) with row sorted based on the column with “n” — no.of.buckets.

Bucketing can benefit when pre-shuffled bucketed tables are used more than once in the query.

3. Serialization :
Serialization plays an important role in the performance for any distributed application. By default, Spark uses Java serializer.
Spark can also use another serializer called ‘Kryo’ serializer for better performance.

Kryo serializer is in compact binary format and offers processing 10x faster than Java serializer.

To set the serializer properties:
conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

4. API selection :
Spark introduced three types of API to work upon — RDD, DataFrame, DataSet

* RDD is used for low level operation with less optimization
* DataFrame is best choice in most cases due to its catalyst optimizer and low garbage collection (GC) overhead.
* Dataset is highly type safe and use encoders. It uses Tungsten for serialization in binary format.

5. Avoid UDF’s :

UDF’s are used to extend the functions of the framework and re-use this function on several DataFrame.
UDF’s are once created they can be re-use on several DataFrame’s and SQL expressions.

But UDF’s are a black box to Spark hence it can’t apply optimization and we will lose all the optimization Spark does on Dataframe/Dataset. Whenever possible we should use Spark SQL built-in functions as these functions designed to provide optimization.

6. Use Serialized data format’s :

Most of the Spark jobs run as a pipeline where one Spark job writes data into a File and another Spark jobs read the data, process it, and writes to another file for another Spark job to pick up.

When we have such use cases, prefer writing an intermediate file in Serialized and optimized formats like Avro, Kryo, Parquet. Any transformations on these formats performs better than text, CSV, and JSON.

Performance Optimization in Apache Spark

Written by Harun Raseed Basheer