Spark’s Architecture And Cluster Manager
Introduction
keep talking about spark components, in this post I will go throw spark architecture, especially the cluster manager component which is maybe an independent component that has many solutions to choose from.
Spark’s Architecture
Spark is based on a master/slave architecture that runs on a cluster of nodes. The master node containing the Spark application acts as the driver program, which creates the SparkContext or SparkSession objects SparkContext allocates the required resources on the cluster by communicating with the cluster manager of the worker nodes, where executors run.
Spark support several cluster managers, including standalone cluster managers, Kubernetes, Apache Hadoop YARN, and Apache Mesos.
Spark Application
as we know spark is a directed acyclic graph (DAG) execution engine that offers in-memory computation.
each Spark application is converted into a DAG that is executed in parallel by the core engine across multiple nodes of a cluster
in spark application we have some key concepts that we should know about it :
- Spark application: Here you have the source code program built using Spark APIs, which logically consists of the driver and executor processes on the cluster.
- Spark driver: This is the main process of the application and is responsible for creating the SparkContext that coordinates with the cluster manager to launch executors in a cluster and schedule work tasks on them.
- Cluster manager: The external (standalone, Kubernetes, YARN, or Mesos) component that manages and facilitates the cluster resources for your Spark application is the cluster manager.
- Executors: These are the distributed processes on the worker nodes responsible for the concurrent execution of tasks assigned to them from the driver each spark application gets its own set of executors.
- Task: This entails the actual computation units of the Spark application handled by the executors.
- Job: A Spark application consists of one or multiple jobs triggered by Spark actions(collect, show,….), which run in parallel and are composed of multiple sages. Underneath, the driver converts each job to a DAG when all stages finish their work, the job is completed.
- Stage: The subdivided sets of tasks of the Spark job are called stages.
Cluster Manager
Apache Spark is an engine for Big Data processing. One can run Spark on distributed mode on the cluster. In the cluster, there is a master and n number of workers. It schedules and divides resource in the host machine which forms the cluster. The prime work of the cluster manager is to divide resources across applications. It works as an external service for acquiring resources on the cluster.
Apache Spark system supports three types of cluster managers namely-
a) Standalone Cluster Manager
b) Hadoop YARN
c) Apache Mesos
D)Kubernetes
we will talk in two of them Hadoop Yarn and standalone cluster manager
Hadoop Yarn
YARN data computation framework is a combination of the ResourceManager, the NodeManager
The Resource Manager has scheduler and Application Manager. The Scheduler allocates resource to the various running application(The Scheduler API is specifically designed to negotiate resources and not schedule tasks. The scheduler does not perform monitoring or tracking of status for the Applications) The Scheduler performs its scheduling function based the resource requirements of the applications; it does so base on the abstract notion of a resource Container which incorporates elements such as memory, CPU, disk, network etc.The Application Manager(AM) manages applications across all the nodes.
NodeManager contains Application Master and container. A container is a place where a unit of work happens. Each task of MapReduce runs in one container. The per-application Application Master is a framework specific library. It aims to negotiate resources from the Resource Manager. It continues with Node Manager(s) to execute and watch the tasks.
Standalone Cluster Manager
Is a simple cluster manager incorporated with Spark. It makes it easy to setup a cluster that Spark itself manages, It has masters and number of workers with configured amount of memory and CPU cores. In Spark standalone cluster mode, Spark allocates resources based on the core. By default, an application will grab all the cores in the cluster.
Each Apache Spark application has a Web User Interface. The Web UI provides information of executors, storage usage, running task in the application. In this cluster manager, we have Web UI to view cluster and job statistics