Apache Spark Architecture
Spark is an open-source, distributed computing software, which processes Big data in less time.
Spark is a distributed computing software to process Big Data in parallel across the cluster of Commodity machines. Spark does not manage clusters nor it runs any nodes, instead it provides the instructions on how to work with data on large scale.
Distributed Computing is a technology paradigm to execute applications with commodity hardware in parallel. Just like Ants(computers) working together to move larger weights(big data). Instead of using Supercomputers to solve big data problems, the same can be performed on many smaller machines, as there is no limitation on extending the number of machines, while there is a limitation to the super computer’s computing performance.
Spark resolves a user application into a distributed spark plan. Spark does not manage the Cluster of Workers instead they are managed by the Cluster Managers like YARN, Mesos, Kubernetes to name a few. Spark has two types of Nodes,
- Driver
- Executor.
DataFrame, Dataset and RDD
Spark holds all the Data in memory and executes the requested commands on it. This makes it a lot faster than compared to Map and reduces applications. Spark can hold data in memory in three different Types,
- RDD (Resilient Distributed Dataset)
- Dataset
- DataFrame
RDD
An RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDD is the building block of Spark Application, all the data “stored internally as RDD”. Any operation used on DataFrame or Dataset is internally converted to RDD.
Dataset
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
DataFrame
DataFrame is an abstract layer on Dataset. DataFrame is the most used and has the most optimized function to work with. The SQL-like APIs make it a lot easier to work, with less room for error. Users can write custom functions with the help of UDF (User Defined Functions) and UDAF (User Defined Aggregate functions)
Spark Plan
Spark APIs are lazy evaluated, Spark doesn’t execute until it finds an action. For example, reading the Parquet file and performing data transformations will not trigger the Spark execution. Instead, all the queries are noted down by Spark and it creates a plan. An action has to be called to execute the issued queries.
Few actions on DataFrame are shown, collect, write. Spark executes all of the instructions in the requested order. Under the hood, Spark does a lot of optimization to make it a lot faster and might make some changes to the order of execution.
Spark converts all the queries into unresolved logical plans with the help of the catalog, the Unresolved plan gets converted into a Logical plan. Later, the Logical plan is optimized and physical plans are formed. Based on the cost model the best plan gets selected and executed on the Data in the form of tasks.
Key takeaways, Logical plan doesn’t have instructions on how to run the queries on RDD, Physical plan does. The catalog is a Catalog of the columns from the DataFrame.
Cluster Manager & Nodes
Cluster Manager (Master) is an application that manages the lifecycle of nodes (Worker). Cluster Manager creates or destroys nodes as per the user requests or application needs. Nodes are instances that perform the task assigned by Spark.
Spark requests the cluster manager to provision the driver and the driver will request the cluster manager to provision the Worker nodes for Spark. The later driver will assign the jobs to the executors which are created on top of worker nodes.
Driver
Spark Driver is responsible for assigning work to the executors and tracks the status of the work, but it’s not the driver's responsibility for creating and maintaining the executors. When a Spark Application is submitted to the cluster manager a new Driver is created. The Spark Session is the Driver Node. The Driver can be created outside of the cluster or within the cluster. There are different types of modes for Submitting the Spark Application.
- Cluster Mode
- Client Mode
- Standalone Mode
Cluster Mode
The Driver is allocated within the cluster and co-resides with the executors, making it one of the best combinations to work with. With this mode, the Driver can leverage the cluster's internal network between the executors. This mode is most preferred among Spark users.
Client Mode
In client mode, the driver runs on the client machine (when spark-submit is invoked) sends the instructions to the cluster manager, and communicates with the executors outside the cluster network.
Standalone Mode
This mode mimics the distributed system on a single machine. The sole purpose of this is to test and develop the spark applications locally in Personal computers. Both the Driver and Executors run on the same machines by leveraging the threads for mimicking the core parallelism.
Executors
Executors are virtual instances of Spark applications that can run on a node. Executors can be one or more in a node. Executor’s Memory & Cores can be configured with the help of Spark config. Executors run tasks which are the Basic blocks of spark execution.
Execution: Job, Stage & Task
The Task is a basic execution unit of a Spark application. An individual compute core is assigned to each Task by default, it can be changed with the Spark configuration options for optimization purposes. All the Tasks are packed into a compact unit called Stage. Multiple stages are combined into a Job.
Spark creates different Stages if a shuffle operation is involved. A Shuffle is a data exchange operation across the executors. Shuffle is performed when an operation like Group by or repartition etc. is used.
Shuffle operation generally means a task can’t be performed with the existing data partition, the task instead needs a different dataset from other partition/s to perform the requested action.
Wide & Narrow Transformations
Narrow transformations are operations that don’t require a full data shuffle. An operation like a filter, select, where, limit, etc. are narrow transformations.
On the other hand, Wide transformations are operations that require a full data shuffle. The operations like group by, repartition, join, etc. are wide transformations.
Cores & Slots
Spark uses the executors(worker) to get things rolling fast. A worker node can have one or more executors. Based on the number of nodes, several cores are allocated to the executors. By default, spark creates a slot in which a task can be executed. The default configuration is one core per slot to work on a task and the number of slots is defined based on the available cores in an executor.
When it comes to developer friendliness Spark is one of the best. Check out more details on Spark documentation here.