Among the many technical frameworks of big data, Spark has been widely recognized since its development. Hadoop and Spark can be said to be the mainstream choice of enterprise-level data platforms, based on different application scenarios, to build a big data system platform that meets the needs. Today we will talk about Spark, the core running process of Spark.
Spark computing mode
Spark is another generation of computing framework proposed after Hadoop. It also focuses on offline batch processing, but on the basis of Hadoop’s native computing engine MapReduce, it has achieved a performance increase of 10–100 times, thus surpassing the native MapReduce in the Hadoop ecosystem. , And gradually get reused.
Spark inherits the characteristics of Hadoop MapReduce and is a typical Master/worker architecture. This architecture divides the computing tasks and then allocates them to multiple slaves, that is, Map. After the slaves have completed the tasks assigned to them, they are then summarized on the Master, that is, Reduce. This is the idea of MapReduce. .
Spark running process
Spark creates a Spark context on the Master. The purpose of creating a SparkContext is to prepare the running environment of the Spark application. In Spark, SparkContext is responsible for communicating with ClusterManager, applying for resources, assigning tasks, and monitoring.
Driver means driving, that is, after the entire system is started, the operation of the entire system is driven by the Driver, and the user’s own jobs are also decomposed and scheduled to run through the Driver.
After resource application, Spark usually asks the resource manager to start its own Worker in the Container, which is the Executor process. At the same time, the start command will carry the URL address of the Driver to facilitate the registration of itself with the Driver after the Executor starts.
After the Executor registers itself with the Driver, everyone knows each other and can communicate with each other, interact according to the protocol, and the entire distributed system is running.
Driver and Executor are directly connected to each other through the RPC protocol. Spark has used two RPC implementations internally in the history, Akka Actor-based RPC and Netty-based RPC.
Executor is the specific executor. After the Executor gets its own Task, it runs the result, and then reports the result to the Driver.
Driver and Executors both run their own Java processes, which can be on the same machine or on different machines.
Spark resource management
As for the resource manager, there are many options. It can be the resource manager implemented by Spark itself, standalone mode, or some more general resource managers, such as Yarn and Mesos, which is why it is said that Spark can run independently by itself, or it can be integrated and coordinated with Hadoop.
Regarding the Spark operating process of advanced big data, I believe that after reading today’s sharing content, everyone will have a clearer understanding. Spark is a core technology framework that must be mastered in big data. It requires a firm grasp of the operating principle and architecture design and proficient use.