The Ultimate Guide to Spark Interview Questions-part 1

Sourav Banerjee
6 min readMay 29, 2020

Commonly asked Spark Interview questions for Experience Professionals as well as Freshers.

The difference in spark Context and spark Session?

SparkContext is used to access all Spark functionality. The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs, and know what resource manager (YARN, Mesos or Standalone) to communicate to.

Spark Architecture
Spark Architecture

Prior to 2.0, SparkContext was the entry point for spark jobs. RDD was one of the main APIs then, and it was created and manipulated using Spark Context. For every other APIs, different contexts were required — For SQL, SQL Context was required; For Streaming, Streaming Context was required; For Hive, Hive Context was required.

val sc = new SparkContext(newSparkConf())

val hc = new hiveContext(sc)

val ssc = new streamingContext(sc).

From Spark 2.0, SparkSession has been introduced. SparkSession provides a single point of entry to interact with underlying spark functionality and allows programming with Dataset and Dataframe APIs. All the functionality of SparkContext is wrapped in SparkSession now.

In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.Once the SparkSession is instantiated, we can configure Spark’s run-time config properties

val spark = SparkSession.builder().master(master).appname(name).getOrCreate().

Spark Yarn Client Mode vs Cluster-Mode?

The major difference in Yarn Client mode and cluster mode is the location where the driver program is run.

In yarn-cluster mode, the driver program will run on the node where the application master is running whereas in yarn-client mode the driver program will run on the node on which job is submitted on a centralized gateway node or edge node.

Spark Deployment Mode

Here you need to understand that the driver program takes a good amount of resources and if many spark jobs are submitted on the centralized gateway node, it can become a bottleneck. As application master might run on any of the worker nodes in the cluster in yarn-cluster mode, driver programs are distributed on the cluster. Hence it is better scalable.

Generally, when the user has to interact with Spark Application using Pyspark or Spark-shell we use client mode. For using spark interactively, cluster mode is not appropriate.

In client mode, the ”driver” component of spark job will run on the machine from which the job is submitted. Hence, this spark mode is basically “client mode”.

  • When job submitting machine is within or near to “spark infrastructure”. Since there is no high network latency of data movement for final result generation between “spark infrastructure” and “driver”, then, this mode works very fine.
  • When job submitting machine is very remote to “spark infrastructure”, it also has high network latency. Hence, in that case, this spark mode does not work in a good manner.

In cluster Mode, the “driver” component of spark job will not run on the local machine from which the job is submitted. Hence, this spark mode is basically “cluster mode”. In addition, here spark jobs will launch the “driver” component inside the cluster.

  • When job submitting machine is remote from “spark infrastructure”. Since, within “spark infrastructure”, the “driver” component will be running. Thus, it reduces data movement between job submitting machine and “spark infrastructure”. In such case, This mode works totally fine.
  • While we work with this spark mode, the chance of network disconnection between “driver” and “spark infrastructure” reduces. Since they reside in the same infrastructure. Also, it reduces the chance of job failure.

What are RDD and Partitions?

RDD stands for Resilient Distributed Dataset. It is an in-memory data structure in spark. prior to spark 2.0, RDD was the only API to interact with all Spark functionality.

RDD mainly denotes the data resides in multiple machines.

From the above diagram, you can think of the blue box as a large file that is segregated in 4 partitions. And these partitions are again shuffled and stored in different machines in a distributed manner.

RDD is Resilient means that it is able to withstand all the losses itself.RDD is of distributed manner means that the data at different locations or partitioned. Datasets means Group of data on which we are performing different operations.RDD is immutable.

There are different ways to create a RDD:

  • Parallelize the present collection in our dataset
  • Referencing a dataset in an external storage system
  • To Create Resilient Distributed Datasets from already existing RDDs

Spark Map vs FlatMap?

Map and FlatMap are 2 of the most used Transformation used in Spark. Both apply to each element of RDD and it returns the result as new RDD.

Map converts an RDD of size ’n’ into another RDD of size ’n’. The input and output size of the RDD will be the same. Or to put it in another way, one element in input gets mapped to only one element in output.

So, for example, let’s say I have an array, [1,2,3,4] and I want to increment each element by 1. The input size and output size are the same, so we can use a map.

Spark code: myrdd.map(x -> x+1)

So, that is what map function does. While using map, you can be sure that the size of input and output will remain the same and so even if you put a hundred map functions in series, the output and the input will have the same number of elements.

Coming to FlatMap, it does a similar job. Transforming one collection to another. Or in spark terms, one RDD to another RDD. But, there is no condition that output size has to be equal to input size. Or to put it in another way, one element in the input can be mapped to zero or more elements in the output.

Also, the output of flatMap is flattened. Though the function in flatMap returns a list of element(s) for each element but the output of FlatMap will be an RDD which has all the elements flattened to a single list.

flatMap is nothing but a combination of flatten and map functionality.

Below is the example of the application of map and flatMap on same input:

map vs mapPartitions in Spark?

In Map, operation spark operates on each row of a dataset. Here we call a lambda function or any user-defined function on each row of the dataset.

But now there can be some scenario where we want to do some repetitive operation on each row. So in this case if we use map operation then it would be kind of a boilerplate case i.e it would create multiple database open and database close operation for each row. In this case, we use mapPartition. In mapPartition, the function is called once for each partition and spark provide use the iterator for all the rows in that partition. So here the database I/O operation is minimized.

If there are 10000 row and 50 partitions, then each partition will contain the 10000/50=200 Rows.

Now, when we apply map(func) method to rdd, the func() operation will be applied on each and every Row and in this particular case func() operation will be called 10000 times. i.e. time consuming in some time critical applications.

If we call mapPartition(func) method on rdd, the func() operation will be called on each partition instead of each row. In this particular case, it will be called 50 times(number of partition). In this way you can prevent some processing when it comes to time critical application.

--

--