Essential concepts in Spark

The Apache Spark core is the basic execution engine of the Spark platform. All other functions are built on this engine. It not only provides memory computing functions to improve speed, but also provides a general execution model to support various applications. In addition, users can use Java, Scala and Python API to develop applications. Spark core is built on a unified abstract RDD, which allows various components of Spark to be integrated at will, and different components can be used in the same application to complete complex big data processing tasks.
What is RDD
RDD (Resilient Distributed Datasets) was originally designed to solve the problem that some existing computing frameworks are not efficient in processing two types of application scenarios, which are iterative algorithms and interactive data mining. In Both application scenarios, by storing data in memory, performance can be improved to several orders of magnitude. For iterative algorithms, such as PageRank, K-means clustering, logistic regression, etc., intermediate results often need to be reused. Another application scenario is interactive data mining, such as running multiple ad hoc queries on the same data set. In computing frameworks just as Hadoop, the intermediate calculation results is to save them to an external storage device (such as HDFS), which will increase additional data replication, disk IO, and serialization efforts. This will increase the work load of the application.
RDD would enable the data reuse in many applications. It is a fault-tolerant, parallel data structure that allows users to explicitly persist intermediate results in memory, and can optimize data storage through partitions. In addition, RDD supports a wealth of operator operations, users can easily use these operators to operate on RDD.
Basic concept
The RDD contains a collection of distributed objects, which is a read-only, partitioned collection of records. Each RDD can be divided into multiple partitions, and different partitions are stored on different cluster nodes. RDD is a highly restricted shared memory model, that is, RDD is a read-only partition record collection, so it cannot be modified. There are two methods to create a RDD, one is to create an RDD based on physically stored data, and the other is creating a new RDD by the help of transformation operations such as map, filter, join.
- Spark is written in scala language, If you want to learn Spark well, you must study and analyze its source code.
- Writing Spark programs in scala language is relatively easy and convenient, concise, and more efficient than Java

Apache Spark is a fast, universal, scalable, fault-tolerant, memory-based iterative computing big data analysis engine. First, it is emphasized that Spark is currently a computing engine that processes data, not a storage system.
Spark RDD and Spark SQL
Spark RDD and Spark SQL are mostly used in offline scenarios, Spark RDD can handle structured data or unstructured data, but Spark SQL handles structured data, and internally processes distributed data sets through datasets
SparkStreaming and Structured Streaming
It is used for streaming, but it is emphasized that Spark Streaming is based on micro-batch processing to process data. Even though Structured Streaming is optimized in real-time, for the time being, compared to Flink, Storm, and Spark’s streaming is a Real-time processing
MLlib
Used for machine learning, of course pyspark also has applications that are based on python for data processing
GraphX
Used for graph calculation
Spark R
Data processing and statistical analysis based on R language
Features of Spark
- Fast
Spark have DAG execution engine, based on memory iterative calculation and processing data, Spark can store the intermediate results of the data analysis process in memory, so there is no need to repeatedly read and write data from external storage systems, which is better than mapreduce It is suitable for scenarios that require iterative operations such as machine learning and data mining. - Ease of use
Supports multiple languages such as scala, java, python, and R supports multiple advanced operators (currently more than 80), allowing users to quickly build different applications; supports interactive query of shells such as scala and python - General
Spark emphasizes a one-stop solution, integrating batch processing, stream processing, interactive query, machine learning, and graph computing to avoid resource waste caused by the need to deploy different clusters in multiple computing scenarios - Good fault tolerance
In the distributed computing data set for fault tolerance through checkpoint, when a link operation fails, do not need to be recalculated from the beginning checkpoint on HDFS - Strong compatibility
It can run on Yarn, Kubernetes, Mesos and other resource managers, realize the Standalone mode as a built-in resource management scheduler, and support multiple data sources