Apache Spark -Level 2

When Spark Job starts — a new Driver process starts and then the Executors [parallel data processing — Tasks] are started on Worker nodes [ Cluster ]

So, the Master program is called the Driver.

The Worker programs [run on slave nodes] are called the Executors.

Master and Executors talk to each other.

Master process:

1) will coordinate the Executors

2) sends App: JAR to all Executors — will be copied to all Worker nodes

3) Later, sends Tasks to Executors [by serializing Function object]

4) map, flatMap, join, filter HOF functions are Tasks

5) will aggregate the final results

6) Once the Driver quits, Executors will shutdown

One Spark Application can have only One Executor running on Worker Node.

Executor will run the Tasks as One Thread on available Cores. One Thread for One Core [ CPU ] for parallelization

  • * RDD is a big Array, which can be broken into Partitions. Each partition [Slice] is passed to each Executor for processing.
  • Partition is also an collection [ subset of RDD ]
  • One RDD [x1,x2,x3,x4,x5,x6] = partition1[x1,x2,x3, x4] and partition2[x5, x6]

Executor: runs the Task on the Partition elements.

Spark Cluster — Processes [JVM]

  1. Master [ JVM process — Single]
  2. Worker [ JVM process — Multiple]
  3. Executor [ JVM process — Multiple ]
  4. Driver [ JVM process — Single]