10 Questions you can expect in Spark Interview

Questions frequently asked in Apache Spark Interviews

Abid Merchant

Published in

Analytics Vidhya

5 min readDec 17, 2019

Hey Fellas,

Data Engineer position is highly in demand in recent times, with Apache Spark being state of the art for Batch processing and ETL, being cognizant in it can easily land you a job as a Data Engineer. So, in this article I will be showcasing 10 questions you can expect in Apache Spark Interview, please note that I won’t be including naive questions like “What is Dataframe?”, “What is Spark RDD?” or “How to read/write orc file?” as I expect that an associate going for a job interview for Apache Spark would be knowing these things already and reiterating them again is pointless.

So all these said, let’s jump to Q/A.

Is Spark Better than Hadoop? Why?

Yes Spark is evidently better than Hadoop, one of the major reason is it is faster than Hadoop because of in memory processing which helps reduce latency for read/write operations. Basically when we use map reduce paradigm, on completion of each task there will be write on disk and when the data has to be used again, read will be performed again. But, in Spark the processing will be done in memory and the dataframes are cached for future use which results in increased performance. Moreover, Spark comes with libraries like Spark ML, Spark SQL, Spark Streaming which makes it more rich.

What is difference between coalesce and repartition?

This is the hotshot topic of discussion when it comes to optimising your spark job. Both functions basically allows us to manipulate the number of partitions of our dataframe, but there uses are different. Repartition will do full shuffle on the data so we can increase or decrease the number of partitions, but coalesce will just shift you data from one partition to other binding us to only decrease number of partitions using it. Coalesce will be faster as shuffle will be less but if the number of partitions has to be increased or the data is skewed and we want to decrease number of partition by reshuffling, then we should go with repartition method.

What is Broadcast Join?

Broadcast join is also used for optimizing Spark Job(particularly joins). When small sized dataframe is joined with relatively larger dataframe, we can broadcast small dataframe which will send a copy of the small dataframe to each node which will result in faster join execution and less shuffling. Syntax is given below.

import pyspark.sql.functions as fn
final = big.join(fn.broadcast(small),["common_id"])

When broadcasting smaller dataframe, we can reduce its partition to 1 for better performance(depending on your use-case).

What is lazy evaluation?

There are two important aspects in Apache Spark, one is action and second is transformation. Transformation includes functions like filter,where,when, on calling these functions Spark does not actually performs those transformations but are stacked until an action is called. When an action is called all the transformations are executed at that point, this helps Apache Spark to optimise the performance of the job. Example of actions are show(), count(), collect().

What is the difference between cache() and persist()?

Both of the Api’s are used to persist dataframes in memory at different levels, but in persist we can specify the Storage level as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY etc whereas in cache() we cannot specify the storage level and is considered MEMORY_ONLY by default.

Difference between Rank and Dense Rank?

This a sql question but I included it because we can expect this question if we go in window-partition section. Suppose, we have a dataset as given below:

Name  Salary  Rank  Dense_rank
Abid  1000     1      1
Ron   1500     2      2
Joy   1500     2      2
Aly   2000     4      3
Raj   3000     5      4

Here salary is in increasing order and we are getting rank() an dense_rank() for the dataset. As Ron and Joy have same salary they get the same rank, but rank() will leave hole and keep “3” as empty whereas dense_rank() will fill all the gaps even though same values are encountered.

How to connect Hive through Spark SQL?

Solution to this is to copy your hive-site.xml and core-site.xml in spark conf folder which will give Spark job all the required metadata about Hive metastore and you have to enable Hive Support along with specifying your warehouse directory location of Hive in configuration while starting your Spark Session as given below:

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

To read in detail about this visit here

Is RDD better than Dataframes?

No, Dataframes are faster in execution than RDDs and it is more easier syntactically. One doubt you might get is Dataframes is converted to RDDs in backend then how is RDD slow? The answer is Dataframe uses Catalyst Optimizer which makes it run faster than RDDs on the other hand RDDs don’t use any optimizer during execution. The comparison between these Api has been explained in detail in this video at Spark Summit 2017.

How to read an xml file in Spark?

This is very simple, spark has spark-xml package which allows us to parse xml files to dataframes, consider the xml file given below:

<person>
    <name>John</name>
    <address>Some more Data</address>
</person>

Then to read the we can specify schema and root tag during read as follows:

xmlDF = spark.read
      .format("com.databricks.spark.xml")
      .option("rootTag", "person")
      .option("rowTag", "name")
      .option("rowTag", "address")      
      .xml("yourFile.xml")

This will give you dataframe with “name” and “address” as columns.

Do you need to install Spark on all nodes of the YARN cluster while running Spark on YARN mode?

No, it is not necessary to install Spark on all nodes when submitting job through YARN mode as Spark runs on top of YARN and uses YARN engine to get all the required resources, we have to install Spark only on one node. Read more about YARN deployment mode here.

So, that’s all folks hope you find my article helpful. Do checkout my previous article on Spark Delta in which I have explained ACID on spark. Till then ta ta!

Spark Delta Lake

Hey Fellas,

medium.com

10 Questions you can expect in Spark Interview

Questions frequently asked in Apache Spark Interviews

Is Spark Better than Hadoop? Why?

What is difference between coalesce and repartition?

What is Broadcast Join?

What is lazy evaluation?

What is the difference between cache() and persist()?

Difference between Rank and Dense Rank?

How to connect Hive through Spark SQL?

Is RDD better than Dataframes?

How to read an xml file in Spark?

Do you need to install Spark on all nodes of the YARN cluster while running Spark on YARN mode?

Spark Delta Lake

Hey Fellas,

Written by Abid Merchant