5 Most Common Spark Performance Problems

Aditya Sahu
Nerd For Tech
Published in
4 min readOct 19, 2021

In this post, I am going to discuss 5 most common spark performance problems. These are very common and mostly ignored or often people get confused where to look for.

We generally get confused from where to start 🤔, should I look into Spark UI ? or My Cluster Configurations ? Also Under Spark UI what to exactly look for ? Spark UI gives hell lot of many informations but there are couple of tabs/sections we should primarily look for ‘at-least’ to not being end up becoming victim to these common problems.

This post will be divided into 6 post’s each having 1 spark problem discussion and it’s mitigation in detail (Idea here is not to make reader’s sleepy by having 1 long post 😅). This is the 1st post in this series having basic problem discussion and warmup (yes you read it right, it’s warmup 😉).

Hang-on Guy’s. Let me make sure to point out. These would not be that confusing or tough to get into. Also, I can bet you that you would definitely find this blog interesting. So bear with me and Let’s complete this journey of Spark Optimization together. ‘Slowly and Steadily’ 😁

Special mention for those who gave me this knowledge and here i am able to write down that for you guys.

Thanks to Databrick’s

Optimizing Apache Spark On databricks Course

Let’s do a bit warmup by understanding below spark code and try to answer the below questions.

Step1:spark.read.parquet(file_path).count()execution time: ~28 sec.Step2:spark.read.parquet(file_path).count()execution time: ~13 sec.

Q1: Why did Step2 took less time than Step1 ?

Step3:spark.read.schema(data_schema).parquet(file_path).count()execution time: ~13 sec.

Q2: Why did Step3 has 1 less job as compared to Step2 ?

Step4:spark.read.schema(data_schema).parquet(file_path).foreach(_=>())execution time: ~13 mins.

Q3: Why foreach() took significantly more time than count() operation ?

Step5:spark.read.schema(data_schema).parquet(file_path).foreach(lambda x: None)execution time: ~2 hrs.

Q4: Why Step5 (Python Code) took significantly more time than Step4 (Scala Code) ?

Take your time and try analysing code snippets and answer above questions. Once you are done, scroll down and let’s walk together to know the reasons behind above mistrys.

Let’s Review:

A1: If we notice Step1 and Step2 both code are identical meaning we are trying to read same data again. Also, If we notice the given image of jobs we can see there are 2 jobs trigger Job2 and Job3. The purpose of Job2 is to read schema thus creating a stage with 1 task and Job3 is to read actual data thus another stage with 825 tasks.

In Step2, since the schema already read spark skips the reading schema again thus we have only a stage with 825 tasks.

So, it looks like we might get performance hits while running same code multiple times (quiet fascinating right?)

A2: We already know the answer to this now. From our A1, due to skip of schema re-read in subsequent run.

A3: So count() action is kind of optimized in terms of operation. Which means with columnar storage format (such as Parquet) it is possible to identify number of records without reading all of the data. On the contrary foreach() iterates over all the records to perform action thus incurring more performance time.

A4: First of all Python is not slow, in-fact we might see in some cases that python is much more faster than Scala. In this specific scenario, If we see code more attentively there is a lambda function, which in case of python needs to be pickled and send to executors and also executors needs to have Python interpreter to execute that code. Thus the cost of serialization is huge here even if that lambda is doing nothing, technically (hold your horses guys, we will be having a last section to discuss about serialization in detail, this is just to give an idea).

So, this ends our very first blog of this series, I hope you found some interesting points w.r.t Spark and also I personally found out these points to be interesting so thought to share with you all.

We will be having our next blog coming out very soon.So stay tuned 🤞. Just to give you a glimpse, we will be talking about Spark’s ‘Skew Problem’ [It’s available now]. Till then keep surfing 😛.

--

--