Things to care before “Productionizing“ Apache Spark

Published in

Analytics Vidhya

5 min readNov 23, 2019

After inception of Apache Spark in 2015 there has been no stopping to it, with most Tech Giants embracing this technology it has been state of the art for unified Data Processing. Moreover, it comes with Spark Mllib library which makes it one shop stop for Data Engineering and Data Science projects for large scale Data analysis. But, apart from Spark being Fast, Robust and really cool it has some loopholes which gave me a tough time at my work and so have decided to put up a list of challenges we face in spark and most of the impediments can be easily overcome if known prior.

Null Pointer Exception

This exception is my worst nightmare and I literally mean it, many of us would have faced this exception. It took me almost 3 days to debug and reach to the exact issue and still a small mistake in code can lead to this problem. Let me show you how this issue comes in logs:

Encountering this issue, my first thought was checking Null rows or some null data in columns where nulls are not allowed. But, after so much toil and trouble I thought of going down to file structure and check the files. After some debugging I got to know that the issue comes when there is mismatch in datatype in column. To throw some more light on the issue, let me show the code I used to recreate the exception

import pyspark.sql.functions as fndf = spark.range(1,5)df.write.orc("/mnt/data/nullPointer")df1 = df.withColumn("id", fn.lit("id").cast("String"))df1.write.orc("/mnt/data/nullPointer", mode="append")new = spark.read.orc("/mnt/data/nullPointer")new.show()

So, basically I created a dataframe with column having data in integers and overwrote to a folder, on the same data I appended same column with “String” datatype, on writing Spark gave me no issue as Spark follow “Schema on Read” and don’t care what type of data goes into Data Lake. When I read the data it gave me no issue, even on performing some operations like count(), show() it might not give you issue as show() might take data from only one partition and count() will just count the rows. But issue might come, when you try to aggregate data of that column or you show more rows and in that case you might encounter this exception and you might be clueless just like me and start meddling with the data which will not get you any near to real issue which is at storage level where data is corrupted.

Resolution: This type of issue might come when you have delta loads and data is handled in different scripts, so every time make sure you cast your variables and datatypes of each column are in sync. Once you face this issue, even I don’t know how to get the data back, only work around I got is recreating the table in Hive as Hive will be able to read the data even though datatypes is files are not matching.

Broadcast Exception

This issue is very rare if you are working on small datasets. I faced this issue very few types but still when something is going to be in production, we have to make sure each of its last bit is in proper position. This issue comes when you broadcast join a table and it takes very long for broadcasting. There is a default threshold time if it exceeds that, your script will raise an exception.

Resolution: You can initially increase the timeout for broadcast if you suspect that the dataframe you are broadcasting is large

spark.conf.set("spark.sql.broadcastTimeout", valueAsPerYourNeed)

You can set it to “-1” for infinite time.

OutOfMemoryError: Java heap space

This issue is very common for me considering the dataset I meddle with. Seeing this issue the first obvious resolution is:

And the answer was not incorrect fully and I increased the memory for my job and it worked for me, but some days later I again got the same issue and there is a certain extent you can increase resources for a single job, so I found another resolution to it.

Resolution: Firstly you can try to increase memory using below command

spark-submit --num-executor 5 --total-executor-cores 30 --driver-memory 5G(to be increase) --executor-memory 5G(to be increased)

Another resolution is you can debug an find the dataframe whose transformation is getting the issue, and increase the number of partitions using below command:

df_repart = df.repartition(500) #as per you requirement

Default is “200” and can be increased as per your need. Increase number of partitions will reduce load at each executor resulting in more stable job execution.

Apart from these here is a list to make sure of issue before going to production:

Overwriting on same file is not allowed in Spark, it will delete your data then will give you an exception(very frustrating at times!).
Append on empty file will raise an exception.
Make a separate config file for your url or you will die changing it while going to new environment.

So to conclude, Apache Spark is a very powerful tool and having it as a part of your project will reap you performance improvements but it has some flaws and the list I created in not exhaustive. So, we have to be agile and preemptive to create seamless Spark jobs.

Moreover, some of the issues are now solved with Delta Lake coming into picture from 2019 giving a new dawn to ACID operations in Spark. If you want to know about it check out this article.

Spark Delta Lake

Hey Fellas,

medium.com

So that’s all folks, if you like my article don’t forget to give it a clap and to follow me for more exciting articles. Till then good bye!