Spark Architecture and Deployment Environment

A spark application consists of a driver which run either on the client or on application master node and many executors which run across slave nodes in the cluster.

An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived server continually satisfying requests. Unlike MapReduce, an application will have processes, called Executors, running on the cluster on its behalf even when it’s not running any jobs. This approach enables data storage in memory for quick access, as well as lightning-fast task startup time.

Job of Spark Driver

It is responsible for creating spark context, creating DAG, breaking the job into stages and task and scheduling of the task. It defines the transformations and actions applied to the data set.

At its core, the driver has instantiated an object of the SparkContext class. This object allows the driver to acquire a connection to the cluster, request resources, split the application actions into tasks, and schedule and launch tasks in the executors.

The driver first asks the application master to allocate resources for the containers on the worker/slave nodes and create executors process.

Once the executors are created the driver directly coordinates with the worker nodes and assign the task to them.

Job of Executor

An executor is a JVM process which is responsible for executing the task. many tasks can run in parallel in the executor.

MapReduce runs each task in its own process. When a task completes, the process goes away. In Spark, many tasks can run concurrently in a single process, and this process sticks around for the lifetime of the Spark application, even when no jobs are running.

The advantage of this model, as mentioned above, is speed, Tasks can start up very quickly and process in-memory data. The disadvantage is coarser-grained resource management. As the number of executors for an app is fixed and each executor has a fixed allotment of resources, an app takes up the same amount of resources for the full duration that it’s running. (When YARN supports container resizing, we plan to take advantage of it in Spark to acquire and give back resources dynamically.)

Cluster Deployment

The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources for containers on which spark executors runs.

Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

Submitting Application (spark-submit)

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.

./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)

if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.

Cluster Manager

Spark Standalone cluster

Standalone Master is the resource manager for the Spark Standalone cluster

Standalone Worker (aka standalone slave) is the worker in the Spark Standalone cluster .

Standalone cluster mode is subject to the constraint that only one executor can be allocated on each worker per application.

A client first connects to standalone master and ask for the resources from the standalone master and start the executor process on the worker node and the driver either on the client node itself or in one of the worker node.

Here the client act as the application master which is responsible for requesting the resources from the resource manager/standalone master.

Hadoop Yarn

YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem.

YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.

The Application Master oversees the full lifecycle of an application, all the way from requesting the needed containers from the Resource Manager to submitting container lease requests to the NodeManager.

Each application framework that’s written for Hadoop must have its own Application Master implementation. Spark also has the implementation of application master.

Yarn Architecture

Yarn Vs Spark Standalone cluster

  • YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration.
  • Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.