Understanding Spark Deployment Modes: Client vs Cluster vs Local

Sephinreji
4 min readFeb 14, 2023

--

Spark cluster modes refer to the different ways in which Spark can be deployed on a cluster of machines to process large-scale data. As we know Spark is a distributed computing framework that we can run in several modes, ranging from a single machine to a large-scale cluster of machines.

Before entering in to each modes, we need to know some basic terms:

  1. Resource Manager: In Spark, a resource manager is responsible for allocating resources to Spark applications running on a cluster. Spark has a pluggable architecture that allows different cluster managers to be used as resource managers, such as Apache Mesos, Hadoop YARN, and Kubernetes.
  2. Application Master: Application Master is a program that is responsible for managing a specific Spark application running on a cluster. Some major responsibilities of Application Master is Allocating resources, coordinating tasks, fault tolerance and reporting the status of spark application to the Driver program.
  3. Container: Each container represents a virtual machine, or a process, that runs on a node in the cluster and is assigned with a certain amount of resources to execute the tasks or the executor. Containers are created by the resource manager, such as YARN or Mesos.
  4. Executor: In Spark, an executor is a process that runs on a worker node in a cluster and is responsible for executing tasks. It’s launched by the Application Master and are responsible for processing the data and performing computations.

Client Mode:

When u start a spark shell, application driver creates the spark session in your local machine which request to Resource Manager present in cluster to create Yarn application. YARN Resource Manager start an Application Master (AM container). For client mode Application Master acts as the Executor launcher. Application Master will reach to Resource Manager and request for further containers. Resource manager will allocate new containers.

Note: In all the below images, the box contains Resource manager, Application Master & containers/Executors is considered as the cluster

Yarn Cluster

Once the containers are allocated with requested configurations, Application Master will launch Executors in each containers. Each executor runs in its own JVM and can run multiple tasks concurrently. Executors can be added or removed dynamically during the execution of a Spark application, depending on the available resources in the cluster.

After these setups, these executors will directly communicate with Drivers which is present in the system in which you have submitted the spark application.

In this mode client have more control over the application. This enables user interaction with the application while it is running, progress monitoring, and application modification as needed. Additionally, it gives the user direct access from the client machine to the application’s output and logs, which can be helpful for debugging and troubleshooting.

Client mode is frequently used in interactive settings where the user has to engage with the Spark application and get real-time feedback, such as Jupyter notebooks or development environments. Additionally, it can be helpful in situations when the user wants to execute a Spark job with a short lifespan and no need for ongoing monitoring.

The major disadvantage of client mode is the driver node is running in client machine and therefore if we shutdown the machine or any issues occur, it will affect the entire application run.

Client mode is not suitable for Production. For debugging purpose we can use this client mode( because u can see output in your terminal)

Cluster Mode:

For cluster mode, there’s a small difference compare to client mode in place of driver. Here Application Master will create driver in it and driver will reach to Resource Manager

Yarn Cluster

For Production systems cluster mode is the best choice.

Local Mode:

In local mode, Spark runs on a single machine, using all the cores of the machine. It is the simplest mode of deployment and is mostly used for testing and debugging.

Here you don’t need to much worry about complex cluster setups or configurations. This is a good choice for users who are just starting with Spark, or for those who need to test small-scale data processing workflows.

Conclusion:

With different modes spark providing, you can choose based on your requirement and use case. Understanding of advantages and disadvantages of each mode is important for a developer to proceed with the use cases.

For more references:

Lineage vs Dag: https://medium.com/@sephinreji98/lineage-vs-dag-109e921d7d48

Apache Spark vs Hadoop: https://medium.com/@sephinreji98/spark-vs-hadoop-47715dc3fd16

--

--