Azure Spark (Data Bricks) Interview Questions

Sanjit Khasnobis
6 min readMar 26, 2023

--

As a Data Engineer when I myself tried to appear for interviews I suddenly used to find myself lost in topics and used to get confused where to start.

With my many years of experience as Data Engineer and Data Architect I am trying to put the Spark Interview questions in one place as a quick refresher. Just to mention you may not find anything new here than what is already available in public internet. But I have tried to explain it in very simple manner so that anyone who want to get a high-level revision just before the interview are welcome to read this blog. I will be really delighted if my small effort can be helpful to my beloved Data Engineers, Data Scientists, Data Analysts, Data ops Community.

So, today we are going to cover some basic Azure Data Bricks Spark Interview Questions so that anyone going for an interview can just come here and Glance through this question before getting into the call or interview session. Hope this help all of you. Let us start without delay.

Question 1: Which one is better to Use Hadoop or Spark?

Hadoop and Spark Difference

Question 2: What is the difference between Spark Transformation and Spark Actions?

Transformation and Action Difference

Transformations are lazily evaluated and only happen when Actions are received.

Question 3: What is the basic difference between pyspark, data bricks and Elastic Map Reduce?

Apache spark is an open-source framework used for Big Data processing.

Data Bricks is another framework written top upon Spark. It provides more reliability and efficient processing framework.
The companies buying license of Data Bricks will be able to get support from Data Bricks Engineering Team in case of any critical problems.

Elastic Map Reduce framework written by Amazon top upon Spark. It provides more reliability and efficient processing framework.

Question 4: What is Lazy Transformation?

If the Execution of the pyspark job do not happen immediately but it is waiting for Action(s) to arrive, then we call it Lazy Evaluation.

When we call any operation or Transformation in RDD or Data frame it does not execute immediately. Spark adds them to DAG (Direct Acyclic Graph) of computation and only when driver asks for some data, this DAG started executing.

It is an optimization technique to reduce number of Queries to be triggered. This process of Lazily evaluating the execution plan and producing the best optimized plan before actual execution of Spark job is called Lazy Evaluation.

Question 5: What is RDD and Data Frames?

RDD is Resilient Distributed Dataset.

Distributed because these are the elements that run and operate on multiple nodes to do parallel processing on a cluster.

RDDs are immutable elements, which means once you create an RDD you cannot change it.

RDDs are Resilient because these are fault tolerant as well, hence in case of any failure, they recover automatically.

You can apply multiple operations on these RDDs to achieve a certain task.

To apply operations on these RDD’s, there are two ways −

  • Transformation and
  • Action

Transformation − These are the operations, which are applied on a RDD to create a new RDD. Filter, group By and map are the examples of transformations.

Action − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

As per Data Bricks -

Data Frame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

In simpler terms Data Frame can be represented as Rows and Columns.

Data Frames can be constructed from a wide array of sources such as

  1. structured data files.

2. tables in Hive.

3. External databases.

4. Existing RDDs.

Question 6: What is Spark Partition?

In spark data is internal stored as RDD (Resilient Distributed Dataset)
The Big size data cannot fit into single node.

Thus, we need to divide it into partitions across various nodes, spark automatically partitions RDDs.
Also, automatically distributes the partitions among different nodes.

In Apache spark, partitions are basic units of parallelism and RDDs, in spark are the collection of partitions.

There are some transformations those require shuffling of data across worker nodes, they greatly benefit from partitioning. For Example, co-group, groupBy, groupByKey and many more. These operations also need lots of I/O operations.

As a result, by applying partitioning we can reduce the number of I/O operations rapidly. Thus, it speeds up the data processing. As spark works on data locality principle.

It means for processing worker nodes takes the data which are nearer to them. So, partitioning reduces network I/O. Hence, processing of data becomes a lot faster.

The below command will create list of 100 integers with 10 partitions.

integer_RDD = sc.parallelize (range (100), 10)

Best way to decide a number of spark partitions in an RDD -

  1. Is to make the number of partitions equal to the number of cores over the cluster.

2. This results in all the partitions will process in parallel. Also, use of resources will do in an optimal way.

Question 7: What is shuffling in pyspark ?

Shuffle is the process of re-distribution of data between two partitions for the purpose of grouping together data with the same key value pair under one partition.

This happens between two stages whenever a wide transformation or repartition is called.

Shuffle happens whenever there is a wide transformation, or a repartition is called.

Suppose there are country specific data seating on different partition in spark. The moment user triggered a query to find the group by of data on the context of Country Code the shuffle will start to take the all the same
Country Code data to same partition to calculate the group by operation.

Shuffling generally happen when we are performing bykey operations such as GroupBykey or ReduceBykey.

Question 8: How do we read a file in Spark?

We read the file in spark using function spark.read

We can specify the format of the file which you want to read like text,csv,json or parquet.

We can specify extra parameters like headers, multiline etc.

Below are the Examples of some different kind of reading of files-

spark.read.text(‘file_path’) — To Read Text File
spark.read.csv(‘file_path’) — To read csv file
spark.read.format(‘json’).load(‘file_path’) — To read json file

Question 9: What is the difference between narrow and wide transformation in spark?

When compute data live on a single partition and there is no data movement between partitions to execute
transformation we call it “Narrow Transformation”.
Example of Narrow Transformations are — filter(), map()

When compute data live on a more than one partition and there is data movement between partitions to execute
transformation we call it “Wide Transformation”.
Wide transformations are the result of groupbykey and reducedbykey.(join ang aggregate functions).

As this shuffle the data we call it Shuffle Transformations also.

Question 10: What is use of inferschema in spark?

inferschema option tells the reader to infer data types from the source file. This results in an additional pass over the file resulting in two Spark jobs being triggered. It is an expensive operation because Spark must automatically go through the CSV file and infer the schema for each column.

df=spark.read.format(“csv”).option(“inferSchema”,”true”).load(filePath)

So inferschema scan the input file automatically and assign a schema to the resulting data frame.

It can slow down the process as the file have to be scanned through, so it is better to define your own schema in spark.

--

--

Sanjit Khasnobis

I am passionate Data Architect/Engineer, computer programmer and problem Solver who believe presenting right data can make big difference in life for all.