Notes about Spark internals

Asma Zgolli, PhD
DataNess.AI
Published in
7 min readDec 19, 2023

This article presents a summary of Apache Spark concepts and internal characteristics. I also include a brief overview of its different APIs.

Note: For the sake of reading simplicity, we will refer to Apache Spark as ‘Spark’ throughout the rest of the article.

Introduction

Spark is a cluster computing framework designed for managing and processing big data. It leverages commodity hardware to scale horizontally, enabling the processing of larger datasets. It also offers a rich API for large-scale distributed data processing and scalable analytics.

Spark employs parallel processing and distributed computing techniques to ensure the efficiency of its data-driven applications. It was developed as a successor to MapReduce, specifically designed to overcome latency and efficiency limitations. Indeed, MapReduce tends to be slow due to its practice of storing intermediate results on disk. In contrast, Spark offers an API with concepts and data structures that enable it to execute data flows in memory, effectively minimizing resource-intensive I/O operations.

Spark Internals and characteristics

Multiple concepts make Spark an excellent choice for efficient big data workload execution:

  • Distributed processing: The figure below depicts Spark’s architecture, comprising a master node and worker nodes. The master node receives the workload from the driver program (the client) via the SparkContext defined in the application and orchestrates the processing: It decomposes the workload (which is called in Spark a job) into tasks and sends them to the worker nodes, who will take care of the execution.
Cluster overview [1]
  • Parallelism: Under the hood, Spark uses a collection of immutable (in-memory read-only) objects that are distributed across the cluster characterized by their fault tolerance called RDDs: Resilient distributed datasets. Those RDDs are built via parallel transformation like map, filter, etc. The splitting of the data across the worker nodes and the parallel execution using these transformations make Spark a fast and scalable processing engine. RDDs are the physical representation of data in Spark. Using the RDD API makes it more challenging to develop optimized applications, as it relies solely on the developer’s expertise. Utilizing higher-level APIs like the DataFrame API and SparkSQL enables us to benefit from the query optimizer, making the development of efficient applications easier.
  • Lazy evaluation: Spark tasks fall into two primary categories: actions and transformations. The core concept here is that one can generate as many transformations on an RDD as necessary and subsequently invoke an action to initiate the computation. Every transformation creates a new RDD (immutability). There are two kinds of transformation : (i) narrow transformations that execute the process where the data resides and do not need any data movement e.g. map, flatMap, filter, union, etc. and wide transformations that need data shuffling (very expensive) of the partitions across the nodes e.g. reduceByKey, join, aggregate, repartition, etc. Actions, on the other hand, are functions that return anything but an RDD (e.g. a value, displays data or a collection of data). Examples of actions are: collect, show, count, countByValue, max, etc.
  • DAG: Processing in Spark is modelled as a directed acyclic graph (DAG). Indeed, when Spark executes tasks in memory, they are structured within a graph having RDDs as vertices and the transformations producing them as edges [2]. The flow of execution follows only one direction (directed) and without cycles (acyclic). This structure helps prevent redundant task execution. Moreover, DAGs can automatically reconstruct RDDs in the event of a failure, as they preserve the lineage of the transformations. DAGs are composed of tasks that are grouped into stages. The RDDs of those stages are computed in memory and stored in a cache, but the resulting RDD of each stage is materialized (to enable failure recovery). On the other hand, storing these intermediary results on disk can introduce additional latency. More modern MPP systems like Impala, which execute data processing entirely in memory, can be more efficient but less resilient to failure. These systems are better suited for less critical analytics applications. The choice between Spark and Impala, for example, will certainly involve a trade-off between fault tolerance and efficiency and should depend on the user’s requirements regarding these characteristics.

Spark APIs:

In the previous section, we explained the different characteristics that make Spark a scalable, distributed, highly available and reliable general-purpose processing engine. As follows, we present its internal APIs and explain some standard examples using them.

Spark RDD API: As we explained earlier, this API is low-level and can be challenging for beginners to develop optimised data pipelines. A classic example of a Spark RDD application is the word count example.

# Read the input file and Calculating words count
text_file = sc.textFile("firstprogram.txt")
countsRDD = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)


# Printing each word with its respective count
output = countsRDD.collect()
for (word, count) in output:
print("%s: %i" % (word, count))

In the above code snippet, I illustrate the word count example using PySpark. The first line of code reads a text file from the specified path in the input argument, creating an RDD as the output. The second line depicts a sequence of transformations aimed at creating an RDD composed of key-value pairs. Here, the key represents a unique word, and the value signifies the count of its occurrences. These transformations include (i) ‘flatMap,’ which creates a new collection from the input by splitting the words separated by spaces and ‘flattening’ them into a single collection; (ii) ‘map,’ which transforms the input (a collection of words) into a collection of key-value pairs, where each key is a word from the input, and the value is the number 1; and (iii) ‘reduceByKey,’ which matches the keys from the key-value collection obtained from the input and executes a lambda function (in our example, a sum) that merges the values.

SparkSQL and DataFrame API: Spark provides the concept of a DataFrame to facilitate structured data processing. This concept is equivalent to a table in the relational data model or a dataframe in the library pandas in Python. DataFrames can be manipulated using a Domain Specific Language API (DSL). This DSL API is composed of functions that resemble SQL Operators, making it intuitive for SQL engineers to develop distributed and parallel data processing applications. For instance, in a Spark DataFrame transformation application, one can filter data using conditions, aggregate it using predefined aggregation functions, join it with another DataFrame, and transform it using User Defined Functions (UDFs).

On the other hand, SparkSQL allows the users to query those DataFrames using SQL (or more precisely, an SQL-like query language. Starting from version 3, Spark even supports ANSI-compliant SQL). In the latest versions of Spark, both SparkSQL queries and DataFrame API programs can be optimised by the query optimizer using statistics and cost models. They also integrate out-of-the-box with tables defined in Apache Hive.

val df2 = spark.read.format(file_type) 
.option("inferSchema", infer_schema)
.option("header", first_row_is_header)
.option("sep", delimiter)
.load(file2_location)


//Creating a temporary table that will be deleted when the notebook (Databricks) session is shut down.
val temp_table2_name = "gdp_over_hours_worked"
df2.createOrReplaceTempView(temp_table2_name)

//Viewing the schema inferred by Spark for the DataFrame.
df2.printSchema()

//Casting of some columns types (code available in the Github repository)

//Transforming the data and enriching the Dataframe with a new column
var enrichedDF = transformedDF.withColumn("unemployment_category",
when(col("unemployment_r") <= 4, "Low")
.when((col("unemployment_r") > 4) && (col("unemployment_r") <= 11), "Moderate")
.otherwise("High")
)

//Saving data to a Hive table
permanent_table_name = "enriched_gdp_over_hours_worked"
enrichedDF.write.format("parquet").saveAsTable(permanent_table_name)

In the code example above, we use two datasets df and df2 loaded from a distributed file system (Databricks DBFS). These datasets are retrieved from Kaggle [3] and they represent the data used to compare the productivity growth and the economy in Europe and the USA in the following article [4]. The full code that includes the data loading lines is provided in Dataness.AI’s GitHub. In our example, we employed the Spark DataFrame DSL, which offers a concise and elegant approach to writing data pipelines, rather than using the alternative solution to developing those pipelines by mixing Spark Scala code with SQL query strings.

The proposed data pipeline executes a series of transformations on df and df2 to enrich df2 with new descriptive columns that will be helpful in the data analysis of populations in the world. It executes :

  • Classification of the data into different categories based on the unemployment rate value.
  • Aggregation of the existing data using various grouping combinations.
  • Transformation of date columns to enhance clearer comparisons with existing records.
  • Filtering, joining and ordering the data based on specific columns.

The resulting aggregated data is stored in Hive for future use (e.g. in a dashboard).

Conclusion:

In this article, I’ve explored the internal workings of Spark and its components, which enable the creation of scalable and efficient data processing applications, leveraging its distribution and parallelism capabilities. I also showcased an example of an ETL application that reads and loads data from external sources. Please feel free to post any questions you have about Spark and the libraries discussed here in the comment section below. Also, I welcome your suggestions for the next data engineering tools that you would like to see explained.

Bibliography:

[1] Apache Spark. (n.d.). Cluster Overview. Spark Official Documentation. https://spark.apache.org/docs/latest/cluster-overview.html

[2] Chaudhari, V. (2023, Sept 12). Understanding Spark DAGs. Medium. https://medium.com/plumbersofdatascience/understanding-spark-dags-b82020503444

[3] The Economist and Solstad, Sondre, 2023. “All work and no play”, The Economist, October 4th issue, 2023.

[4] The Economist and Solstad, Sondre. (2023, Oct 04).GDP per hour worked. Kaggle. https://www.kaggle.com/datasets/joebeachcapital/gdp-per-hour-worked/data

--

--

Asma Zgolli, PhD
DataNess.AI

Machine Learning Engineer | Expert in Machine Learning, Deep Learning and Big data | Enthusiastic public speaker and tech writer #NLP #Timeseries #Graphs