Essential concepts in Spark

Sajjad Hussain
Jan 11 · 4 min read
Image for post
Image for post

The Apache Spark core is the basic execution engine of the Spark platform. All other functions are built on this engine. It not only provides memory computing functions to improve speed, but also provides a general execution model to support various applications. In addition, users can use Java, Scala and Python API to develop applications. Spark core is built on a unified abstract RDD, which allows various components of Spark to be integrated at will, and different components can be used in the same application to complete complex big data processing tasks.

What is RDD

RDD (Resilient Distributed Datasets) was originally designed to solve the problem that some existing computing frameworks are not efficient in processing two types of application scenarios, which are iterative algorithms and interactive data mining. In Both application scenarios, by storing data in memory, performance can be improved to several orders of magnitude. For iterative algorithms, such as PageRank, K-means clustering, logistic regression, etc., intermediate results often need to be reused. Another application scenario is interactive data mining, such as running multiple ad hoc queries on the same data set. In computing frameworks just as Hadoop, the intermediate calculation results is to save them to an external storage device (such as HDFS), which will increase additional data replication, disk IO, and serialization efforts. This will increase the work load of the application.

RDD would enable the data reuse in many applications. It is a fault-tolerant, parallel data structure that allows users to explicitly persist intermediate results in memory, and can optimize data storage through partitions. In addition, RDD supports a wealth of operator operations, users can easily use these operators to operate on RDD.

Basic concept

The RDD contains a collection of distributed objects, which is a read-only, partitioned collection of records. Each RDD can be divided into multiple partitions, and different partitions are stored on different cluster nodes. RDD is a highly restricted shared memory model, that is, RDD is a read-only partition record collection, so it cannot be modified. There are two methods to create a RDD, one is to create an RDD based on physically stored data, and the other is creating a new RDD by the help of transformation operations such as map, filter, join.

  1. Spark is written in scala language, If you want to learn Spark well, you must study and analyze its source code.
  2. Writing Spark programs in scala language is relatively easy and convenient, concise, and more efficient than Java
Image for post
Image for post

Apache Spark is a fast, universal, scalable, fault-tolerant, memory-based iterative computing big data analysis engine. First, it is emphasized that Spark is currently a computing engine that processes data, not a storage system.

Spark RDD and Spark SQL

Spark RDD and Spark SQL are mostly used in offline scenarios, Spark RDD can handle structured data or unstructured data, but Spark SQL handles structured data, and internally processes distributed data sets through datasets

SparkStreaming and Structured Streaming

It is used for streaming, but it is emphasized that Spark Streaming is based on micro-batch processing to process data. Even though Structured Streaming is optimized in real-time, for the time being, compared to Flink, Storm, and Spark’s streaming is a Real-time processing


Used for machine learning, of course pyspark also has applications that are based on python for data processing


Used for graph calculation

Spark R

Data processing and statistical analysis based on R language

Features of Spark

  • Ease of use
    Supports multiple languages ​​such as scala, java, python, and R supports multiple advanced operators (currently more than 80), allowing users to quickly build different applications; supports interactive query of shells such as scala and python
  • General
    Spark emphasizes a one-stop solution, integrating batch processing, stream processing, interactive query, machine learning, and graph computing to avoid resource waste caused by the need to deploy different clusters in multiple computing scenarios
  • Good fault tolerance
    In the distributed computing data set for fault tolerance through checkpoint, when a link operation fails, do not need to be recalculated from the beginning checkpoint on HDFS
  • Strong compatibility
    It can run on Yarn, Kubernetes, Mesos and other resource managers, realize the Standalone mode as a built-in resource management scheduler, and support multiple data sources

Data Prophet

We have history of million years

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store