Spark Hierarchy

Tharun Kumar Sekar

Published in

Analytics Vidhya

4 min readDec 24, 2019

This article will give you a basic understanding of the terms used in the Big Data and Spark World.

Hardware Hierarchy

Cluster:

Driver
Executor

Cores/Slots — Each executor can be considered as servers and they have cores. A core can be considered a slot that can be used to put workload into. Each core can take one piece of work.

Memory

Each server also has memory. But not all of it is given for Spark. On average 90% of this memory is given for Spark.
At a high level, this memory is divided into Storage and Working Memory.

Storage memory is where data is cached or persisted.
Working memory is where Spark does all in-memory computations

Storage Memory

to store persisted objects
Default configured limit — 50% of total storage

Working Memory

will be utilized by spark workloads
50% of total storage will be used for spark workloads

Disks

Each server also has locally attached/mounted storage.

RAM/SSD/NFS Drives
Better disks would ensure fast shuffling of data

These disks are extremely important because - a lot of time, Spark does Shuffle (data moved around). In the intermediate stages, when data is moved around, that data goes to Disk. Faster the disk, Faster the shuffle.

To understand more about Spark Memory Management, refer to this article.

Software Hierarchy

Transformations (lazy):

Narrow (all the data needed for the transformation is available to CPU at the same time)
Wide ( data needs to be moved around nodes) (requires shuffle)

Action

When we call an action, we spin off all the transformations that spark has staged. Action launches 1 or many jobs depending on the transformations.

Jobs
— 1 Job can have many stages.
Stages
— It is a section of work that is going to be done.
— 1 Stage can have many tasks.
Jobs and Stages are part of the orchestration
Tasks (interacts with hardware directly)
Every task inside a stage does the same thing, only on another segment of the data.
If a task is required to do something different, it’s required to be in the inside of another stage.
One Task is done by 1 core and on one partition.

Shuffle

Shuffle happens whenever Spark cannot perform tasks on individual partitions or it needs data from other partitions for computation.

In the above picture, we have 3 names in 3 partitions and we are trying to get the number of names for each first character. Getting the first character in each name can be done independently. But the groupBy task requires all A’s in a single partition and hence the Shuffle. Shuffle gathers all A’s into a single partition and then takes count of individual names for each first character.

During Shuffle, data from each partition is written into disks based on the Hash keys. In this case, Stage 1 writes data to disk based on the first character. Stage 2 pulls data from disks and gets the count.

Spark Data Read Speed

One of the main reasons Spark is blazingly fast is because it does all processing in memory. Spark leverages memory heavily because the CPU can read data from memory at a speed of 10 GB/s. Whereas if Spark reads from memory disks, the speed drops to about 100 MB/s and SSD reads will be in the range of 600 MB/s.

If CPU has to read data over the network the speed will drop to about 125 MB/s.

I hope now you are able to understand the basic terms in Spark. If you want to understand how memory is utilized by Spark in each executor, refer to this article.
If you want to get into Advanced Spark, refer to this article.

Spark Hierarchy

Written by Tharun Kumar Sekar