BEST WAYS TO OPTIMIZE APACHE SPARK

Offisong Emmanuel
6 min readFeb 24, 2023

--

Introduction

One of the greatest nightmares of a data engineer is slow and inefficient code. Data engineers are always cautious of the performance of their data pipelines. Data engineers working in a company, be it a startup or a Fortune 500 company, generally collect data from various sources, process the data, and load it into an analytical database system where the end users can make the best use of the data. The end users could be data scientists, data analysts, business intelligence developers, or even machine learning engineers.

Time and speed are of great importance to data engineers. Data engineers focus on ingestion, processing, testing, and maintaining data pipelines. You don’t want to spend seven hours running a data ingestion pipeline from Salesforce to Redshift. Irrespective of the tools you use, performance optimization is a high priority for data engineers. In this article, you will learn how to optimize the performance of Apache Spark to achieve faster processing and improved performance.

What is Apache Spark

Apache Spark (frequently called Spark) is an open-source tool for large-scale data processing. Apache Spark is licensed under Apache License 2.0. It is a distributed processing framework that is used to achieve parallelism, building resilient and fault-tolerant data pipelines. Apache Spark is available in programming languages like Python, Scala, Java, and R.

Spark is very easy to get started with as they have very comprehensive documentation and a large community on Stack Overflow and Reddit.

Best practices in optimizing Apache Spark

Use Serialized file formats

In the field of data engineering, there are different file formats. There is Avro, CSV, Parquet, JSON, Kyro, gz and so many others. Most data pipelines involve reading data from a source, transforming the data, and then dumping the transformed data into a destination. This might be the end of your data pipeline in some cases. In complex data pipelines, you might have to read the transformed data, re-process it, and re-dump it into a final destination. It all depends on the business problem you are trying to solve.

Parquet is a columnar file format. Parquet is very similar to ORC. Unlike CSV which reads the data in a row-by-row format, Parquet reads the data in a column format. CSV’s row-based method of reading data makes it slow to query and more difficult to store and process efficiently. This is what makes Parquet stand out. The column-based reading provides extra optimization to speed up queries. Parquet also supports more flexible compression formats which makes it faster than CSV.

Let’s walk through an example


import random
schema=['Company Name','Year founded','Revenue(usd)']
df=spark.createDataFrame([['Company 1',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 2',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 3',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 4',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 5',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 6',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 7',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 8',random.randint(2001,2022),random.randint(10000,100000000)],
['Company 9',random.randint(2001,2022),random.randint(10000,100000000)],
['Company x',random.randint(2001,2022),random.randint(10000,100000000)],
['Company xi',random.randint(2001,2022),random.randint(10000,100000000)],
['Company xii',random.randint(2001,2022),random.randint(10000,100000000)]],
schema=schema)
dataframe=df
dataframe.show()

In the code sample above, you created a basic dataframe. The dataframe created above is shown visually below

To save the dataframe as parquet, you can run the code

dataframe.write.parquet(‘companies_revenue.parquet’)

The example above was used on a small dummy dataset. In the real world, you will be working with terabytes or even petabytes of data, and using the right serialization format might be the secret formula to your workflow performance boost.

Make use of Cache and Persist

Cache and Persist are two very important optimization methods in Apache Spark. These are ways to store your RDDs, Dataframes, and Datasets in memory for them to be used later. While using the cache, the results are saved in memory only. In Persist, you can store the intermediate results in different storage locations both in memory and on disk. When using persist, the partitioned data gets replicated to multiple nodes in Spark. This is necessary for the purpose of fault tolerance. In an event where a particular node goes down, another node picks up the data and performs the transformations on the data.

Cache and Persists methods are necessary because it saves execution time. Here, a computation is only run once. For subsequent runs, the initial computation is rerun. So, your spark code does not get rerun from scratch for every run.

Making use of cache and persist is very easy and intuitive in Apache Spark. You can cache your dataframe by executing the code shown below.

df_cached=dataframe.cache()

The intermediate results of this cached dataframe is stored in memory. Similarly, you can persist a dataframe by

df_persist=dataframe.persist()

Persist takes in the storage level as an argument. If no storage level is specified, it uses the default storage level, which is memory only.

Make use of Coalesce and Repartition

The key to Apache Sparks’ speed is its ability to run several tasks at the same time. It is however intuitive to divide your code into multiple partitions which can be run concurrently. These divisions are called partitions. The big question is “How do i know the optimal number of partitions to choose?”. Choosing the number of partitions depends on the number of available cores you have is a good idea. For example, if you have four available cores and you divide your job into four partitions, each core will process each partition independently. Hence, optimizing your performance and increasing speed. The number of partitions should be solely dependent on your available resources.

You can choose the number of partitions by using coalesce() and repartition(). Coalesce is an optimized way of reducing your number of partitions. Repartition is better when you want to increase the number of partitions.

coalesced_dataframe=dataframe.coalesce(4)

From the image above, you notice that the original dataframe has eight partitions. Using coalesce, the number of partitions was reduced to four. Coalesce is also important because it reduces the number of shuffle sort. You will learn more about shuffle operations in the next section.

By tuning the partition size, you can optimize the performance of your Spark jobs.

Reduce expensive shuffle operations

This subheading was added directly underneath the previous point for a reason. The previous point talks about selecting the right number of partitions using coalesce and repartition. This point is an extended version of it.

Some certain types of transformation trigger some kind of shuffle partition. These transformations are groupByKey(), reduceByKey(),join(), amongst others. Apache Spark provides a configuration spark.sql.shuffle.partitions to help control the number of shuffle partitions created. The default number is two hundred. This number could be quite high depending on the size of your data.

If your data is very small, it will be divided into two hundred partitions and this will lead to too many partitions being exchanged in and out of the nodes.

A good idea is to select a partition equal to the number of cores or available resources you have.

Conclusion

The end goal of any data pipeline is to generate insights or solve some business problem. Whether you code in Python, Java, or Scala, it is always a good practice to write efficient and optimized code. This will lead to faster performance and increased speed of your queries.

In this article, you learned some best practices for optimizing Apache Spark. Some methods covered in this article include:

  • Making use of serialized file formats
  • Making use of cache() and persist()
  • Making use of coalesce() and repartition()
  • Reducing expensive shuffle operations.

--

--

Offisong Emmanuel

I am a big data engineer and machine learning engineer .I build resilient, fault tolerant, distributed and scalable big data driven applications.