Day 5 on Cracking Data Engineering Interview at any company.

4 min readMar 28, 2023

Note : This series will be continued till Day 30 and it is not to target any specific company but to crack any company because at the end we should know how to get the work done once we are onboarded in the company with our technical skills

Learning Apache Spark

What is Spark ?

But let’s understand RAM vs Storage

Question : What will be faster , searching the file on your managed desk or searching it in the messy book shelf.

Answer is desk.

Similar to this analogy , retriving data from RAM is faster because RAM sits very close to the CPU and has a very high bandwidth connection.

Now Apache spark works in similar manner, but wait, we don’t know yet what it is?

Apache spark is just a compute engine meaning it is proving the infrastructure, resources that are needed to run your job faster

Spark is 100% faster then hadoop map-reduce , now why we say that.

because spark processes data in memory (RAM). At the same time, Hadoop MapReduce has to persist data back to the disk after every Map or Reduce action.

People who are new to Map Reduce : It is a programming model provided by hadoop to process big data

So now , we are clear that spark is a compute engine.

Great!

Tech fact : Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009, and open sourced in 2010 under a BSD license. In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.

The Architecture

Consider Spark as an engine of a Bus, a special Bus whose engine configuration is done by driver only.

Now once driver starts the engine with his custom configuration , engine truns on and sound is generated until the driver moves the bus ahead and all passangers are seated.

Now pls note even if the driver starts, one important thing is necessary that is petrol/diesel in the fuel tank and if that is enough, the driver will be able to drive the bus ahead.

In this scenario, The driver is spark driver,

the moment driver starts the engine with its cutom config, spark session is initiated .

Here diesel is Resources in spark achitecture. Fuel Tank is Cluster Manager

Rotating the keys to turn the engine one is called spark submit menaing you are submitting your application to execute.

Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.

Based on that, the execution logic is processed and parallelly Spark context is also created. Using the Spark context, the different transformations and actions are processed.

Now what are transformations ?

Getting wheat flour from wheat is one transformation in life

and getting goood quality and needful business data out of raw data is another transformation in life.

transformation can be filtering some IDs with duplicate names

or it can be running a when and otherwise condition similar to sql.

so we are actually moulding the data.

Now what is action ? Spark is very Lazy, until we ask him to show the results, it won’t execute the task for us.

Meaning we have said to filter the even ids out of million ids.

at this time, it won’t filter , unless we are writing a show command to view the even ids.

So, till the time the action is not encountered, all the transformations will go into the Spark context in the form of DAG that will create RDD lineage.

Now what is Dag ? Directed Acyclic Graph

what is RDD : Resilient Distributed Dataset

Immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with transformations and actions.

Once the action is called job is created. Job is the collection of different task stages.

Once these tasks are created, they are launched by the cluster manager on the worker nodes and this is done with the help of a class called task scheduler.

The conversion of RDD lineage into tasks is done by the DAG scheduler. Here DAG is created based on the different transformations in the program and once the action is called these are split into different stages of tasks and submitted to the task scheduler as tasks become ready.

Then these are launched on the different executors in the worker node through the cluster manager. The entire resource allocation and the tracking of the jobs and tasks are performed by the cluster manager.

As soon as you do a Spark submit, your user program and other configuration mentioned are copied onto all the available nodes in the cluster. So that the program becomes the local read on all the worker nodes. Hence, the parallel executors running on the different worker nodes do not have to do any kind of network routing.

This is how the
entire execution of the Spark job happens.

Join me on upcoming webinar on Apache Spark

https://topmate.io/priya_chauhan/210580

Hope it helped, See you on Day 6!

Day 5 on Cracking Data Engineering Interview at any company.

Written by Priya Chauhan