Big Data Basics Series

Exploring Big Data with Apache Spark: Introduction and Key Components

Indraneel Dutta Baruah
Nerd For Tech
Published in
12 min readDec 16, 2023

--

A comprehensive guide on how Apache Spark works and how to use it efficiently!

Source: Image by Mika on Unsplash

Introduction

If you have ever worked on big data, there is a good chance you had to work with Apache Spark. It is an open-source, multi-language platform that enables the execution of data engineering and data science tasks on single-node machines and clusters. It is compatible with popular programming languages, including SQL, Scala, Java, Python, and R, making it a preferred choice for data analysts and scientists. This 3-part blog series is dedicated to understanding how Apache Spark works. In the first part, let’s try to understand what are it’s key components and how they work!

Table of contents

  • A bit of history
  • Why is it so efficient in handling big data?
  • Key Concepts: RDD
  • Key Concepts: DataFrame and Dataset
  • Key Concepts: DAG
  • Apache Spark Workflow
  • Conclusion
  • Frequently Asked Questions
  • References

A bit of history

Before the arrival of Apache Spark, Hadoop MapReduce was the most popular option for handling big datasets using parallel, distributed algorithms. MapReduce has a multi-step, sequential process for job execution. In each step, MapReduce retrieves data from the cluster, performs operations, and writes results back to Hadoop Distributed File System (HDFS). Since every step necessitates disk reads and writes, the execution of MapReduce jobs is hindered by the inherent latency of disk I/O.

Source: Image by Author

To overcome this issue, Apache Spark emerged as a research project at UC Berkeley’s AMPLab in 2009. This journey led to Spark’s publication (paper titled “Spark: Cluster Computing with Working Sets” ) and it was open-sourced in June 2010. Subsequently, in June 2013, Spark entered the incubation stage within the Apache Software Foundation (ASF) and later earned its status as an Apache Top-Level Project in February 2014.

Why is it so efficient in handling big data?

It achieves this by conducting processing in memory, thereby reducing the number of steps in a job and promoting data reuse across various parallel operations. With Spark, only one step is required, where data is read into memory, operations are executed, and results are promptly written back. This streamlined approach results in 10 to 100 times faster execution times.

Spark’s in-memory caching enhances the performance of machine learning algorithms as well which repeatedly invokes a function on the same dataset. This data reuse is facilitated through the use of DataFrames, an abstraction built on top of the Resilient Distributed Dataset (RDD). We will discuss RDDs and DataFrames in detail later. This innovation makes Spark significantly faster than MapReduce, particularly in machine learning activities, while preserving the scalability and fault tolerance inherent in Hadoop MapReduce.

Key Concepts: RDD

Source: Image by Author

Resilient Distributed Datasets (RDDs) are a distributed collection of immutable Java Virtual Machine (JVM) objects that perform calculations very quickly, and are the backbone of Apache Spark. RDDs are purpose-built for distributed computing, where they partition the dataset logically. This logical division paves the way for efficient and scalable processing by distributing different parts of the data across various nodes within the cluster. RDDs can be generated from a range of data sources, including the Hadoop Distributed File System (HDFS) or local file systems, and can also be created from existing RDDs through transformation operations. They have the following properties:

1. In-Memory Computation: In-memory computation involves processing data stored in the computer’s main memory (RAM), which is significantly faster than reading from disk, resulting in faster data processing.

2. Lazy Evaluations: Lazy evaluations delay the execution of operations until the result is explicitly needed, optimizing performance and resource utilization in data processing.

3. Fault Tolerance: Fault tolerance ensures a system can continue to operate without disruption in the presence of hardware or software failures, enhancing system reliability.

4. Immutability: Immutability refers to data structures that cannot be changed after creation, enhancing data consistency and enabling safe parallel processing.

5. Partitioning: Partitioning involves splitting a dataset into smaller, manageable portions, allowing parallel processing and efficient data distribution in distributed systems.

6. Persistence: Persistence in computing refers to the ability to store data or intermediate results to disk or memory for later use, optimizing performance and reliability.

7. Coarse Grained Operations: Coarse-grained operations involve applying functions to large chunks of data rather than individual elements, reducing overhead and improving processing efficiency in distributed systems.

Finally, two operations are applicable to RDD:

1. Transformations: Involves processes applied to an RDD to produce another RDD, using functions like filter(), union(), map(), and more. Lazy evaluation is employed in their execution.

2. Actions: Returns results to the driver program or writes them to storage, initiating computation. Examples include count(), first(), collect(), and reduce(). It can return different data types.

RDDs are schema-less, meaning they lack a defined structure for data. Hence, RDDs are not type-safe (which means that the compiler knows the columns and their data type). They can hold any type of data, be it structured, semi-structured, or unstructured. They offer no compile-time type checking, and data is typically processed using functional operations (e.g., map, reduce). RDDs are highly versatile but may be less efficient for certain operations, as they do not benefit from Spark’s built-in optimizations for structured data.

Key Concepts: DataFrame and Dataset

Source: Image by Author

DataFrames serve as a structured data representation, akin to relational database tables. They represent distributed data collections organized into named columns and boast a well-defined schema, making data structure explicit. This schema-driven approach enables Spark to streamline query execution making them an ideal choice for structured data processing. DataFrames harness the power of Spark’s Catalyst optimizer and Tungsten execution engine, ensuring optimal performance for structured data operations. Their automatic optimizations span logical and physical levels, culminating in enhanced performance, particularly for SQL-like queries. Like RDDs, it is available in Java, Python, Scala, and R.

Datasets, a more recent abstraction, cleverly blends the strengths of RDDs and DataFrames. They are distributed data collections equipped with schemas, embracing the advantages of both structured and unstructured data processing. These schemas bring type safety and query optimization to the forefront. Datasets are type-safe, supporting compile-time type validation. However, they remain versatile, permitting untyped operations when necessary, akin to RDDs. While benefiting from the same optimization mechanisms as DataFrames for efficient structured data handling, Datasets also accommodate custom optimizations when dealing with intricate or unstructured data. Just like DataFrames, Datasets thrive on automatic query optimization, all while granting the flexibility to employ custom optimizations when the need arises. But it is available in Scala and Java only.

In summary, RDDs are flexible, schema-less, and offer fine-grained control, making them suitable for unstructured data and custom processing. DataFrames are structured, schema-driven, and highly optimized for SQL-like queries on structured data. Datasets, the most versatile of the three, combine the benefits of both RDDs and DataFrames, providing structured data processing with the flexibility to handle unstructured or complex data when needed.

Key Concepts: DAG

Source: Image by Author

The concept of a Directed Acyclic Graph (DAG) is of utmost importance in the context of Spark, forming the bedrock of their execution model. In this context, “graph” pertains to the way tasks are organized, while “directed” and “acyclic” describe how these tasks are structured. The term “directed” reflects the ordered execution of tasks, and “acyclic” signifies the absence of loops or cyclic dependencies. This architecture ensures that each processing stage depends on the successful completion of the prior one, and it allows individual tasks within a stage to operate independently. In essence, the DAG serves as the blueprint for the logical execution plan of a Spark job, breaking it down into stages and tasks that can be executed concurrently across a distributed cluster of machines.

Key components of DAG

In the context of a Directed Acyclic Graph (DAG) in Spark, stages, tasks, and dependencies play crucial roles in managing the execution of a Spark job:

1. Stages: Stages are the fundamental building blocks of a Spark job, representing a set of tasks with a common set of transformations that can be executed in parallel.

In a DAG, Spark divides the job into stages based on the transformations applied (In our image above, we have 3 stages). These transformations are categorized into two types: narrow transformations and wide transformations. Narrow transformations can be computed within a single partition of the RDD and do not require data shuffling, while wide transformations involve data shuffling and require the movement of data between partitions. This distinction influences how stages are formed.

Stages are essential for optimizing the execution of a Spark job. Narrow transformations within a stage can be pipelined together for efficient processing, whereas wide transformations introduce a barrier between stages, as data shuffling is required. This allows for optimizations like the pipelining of narrow transformations and minimizing data shuffling.

2. Tasks: Tasks are the smallest units of work in Spark and are the actual computations that are performed on the data. Each stage is composed of multiple tasks, and these tasks are responsible for executing the transformations defined in the stage on a specific partition of the data. Spark’s task scheduler distributes these tasks across the available worker nodes in the cluster. Each worker node processes a portion of the data in parallel, resulting in faster execution. We will be discussing worker nodes in detail in the next blog in the series.

3. Dependencies: Dependencies in a Spark DAG represent the relationship between stages and indicate how data flows from one stage to another. There are two primary types of dependencies (similar to transformations):

a. Narrow Dependencies: In a narrow dependency, each partition of the child stage depends on a specific partition of the parent stage. This means that there is a one-to-one mapping between partitions, and no data shuffling is required. Narrow dependencies are common in stages with narrow transformations.

b. Wide Dependencies: In a wide dependency, the partitions of the child stage depend on multiple partitions of the parent stage, and data shuffling is necessary. This data shuffling can be resource-intensive and can introduce performance overhead.

In summary, each stage comprises a set of tasks, aligned with the partitions of the RDD, facilitating parallel execution of similar computations. Within the DAG, users gain the ability to delve into each stage and delve deeper into specific details. In the stage view, you can expand and explore the particulars of all RDDs connected to that particular stage. The scheduler adeptly divides the Spark RDD into discrete stages based on the transformations applied.

Fault tolerance using DAG

DAG ensures the robustness and reliability of data processing using the concept of lineage. Lineage is a record of the sequence of transformations that were applied to generate an RDD. As transformations are executed on RDDs, Spark constructs the DAG and simultaneously builds the lineage. The lineage keeps track of the parent-child relationships between RDDs and the transformations that were used to derive each RDD. This lineage information is crucial for recovering lost data in case of failures. If a partition of an RDD is lost due to a node failure, Spark can trace back through the lineage to determine the original source data and recompute the lost partition.

Apart from lineage, RDDs can be made “persistent” using operations like `cache()` or `persist()`, Spark stores its partition data in memory or on disk, depending on the chosen storage level. This means that even if a node fails, Spark can rebuild the lost partitions from the persisted data rather than recomputing the entire RDD. Finally, checkpointing is a mechanism to periodically save RDDs to a stable storage system like HDFS. This approach reduces the amount of recomputation required in the event of failures. If a node fails, the RDDs can be reconstructed from the latest checkpoint and their lineage.

Data Processing Optimization using DAG

Optimizing big data processing using a Directed Acyclic Graph (DAG) in Apache Spark is achieved using the following techniques:

1. Pipelining: It examines task dependencies and identifies opportunities to execute subsequent tasks as soon as their input data becomes available, without waiting for the completion of all previous tasks. Pipelining reduces latency between stages and enhances data processing throughput.

2. Task Fusion: It identifies consecutive operations or transformations that can be combined into a single task. By reducing the number of tasks and the overhead of data shuffling, task fusion improves performance. This minimizes data movement between stages, which can be a significant performance bottleneck.

3. Shuffle Optimization: Spark’s DAG optimizer applies various techniques to optimize data shuffling, including minimizing data transfer and reducing the amount of data to be shuffled, thus improving overall performance.

4. Data Locality: It considers the physical location of data when scheduling tasks. Spark aims to schedule tasks on nodes where the data is already present or is being computed. This minimizes data transfer across the network, reducing network overhead, and improving job performance. Data locality ensures that data is processed as close to where it’s stored as possible.

5. Stage Concurrency: It determines which stages within a job can be executed independently. Stages lacking data dependencies can run concurrently, making efficient use of available cluster resources.

Apache Spark Workflow

Source: Image by Saikrishna Seruvu on Linkedin

In the image above, let’s understand how the DAG scheduler works in Apache Spark workflow:

1. As you input your code into the Spark console, including the creation of RDDs and application of operators, Spark constructs an operator graph.

2. When a user initiates an action, such as “collect,” the graph is sent to a Directed Acyclic Graph (DAG) Scheduler. The DAG scheduler dissects the operator graph into stages or tasks, distinguishing between mapping and reducing operations.

3. Each stage comprises tasks based on partitions of the input data. The DAG scheduler optimizes the graph by grouping operators together. For instance, numerous map operators can be combined into a single stage. This optimization significantly contributes to Spark’s performance. The outcome of the DAG scheduler is a set of well-organized stages.

4. These stages are then handed over to the Task Scheduler, which launches tasks via the cluster manager, be it Spark Standalone, Yarn, or Mesos. Notably, the task scheduler operates without knowledge of the dependencies between stages.

5. The actual execution of tasks takes place in the worker nodes. A new Java Virtual Machine (JVM) is initiated per job, and each worker node solely processes the code it is provided, without awareness of the broader context.

Conclusion

In conclusion, Apache Spark has revolutionized big data processing with its efficient and flexible approach. It’s fascinating to trace back its history, evolving from the limitations of Hadoop MapReduce to becoming the open-source giant it is today. Spark’s ability to handle large datasets is unmatched, thanks to its in-memory processing, data reuse, and streamlined execution.

We’ve delved into key concepts that underpin Spark’s functionality, starting with Resilient Distributed Datasets (RDDs) which serve as the building blocks of Spark. The introduction of DataFrames and Datasets brought structured data processing to the forefront. Finally, a pivotal component of Spark’s execution model is the Directed Acyclic Graph (DAG). The DAG, with its stages, tasks, and dependencies, plays a crucial role in managing the execution of Spark jobs. We also discussed how DAG is able to optimize data processing while being fault-tolerant.

As you journey into the world of Spark, understanding these key components and concepts will empower you to harness its full potential. In the next part, we will discuss the details of the Spark Architecture (Driver node, worker node, cluster manager, etc.) along with how to set up an ideal Spark session configuration. Stay tuned!

Hope you found this article useful. Connect with me on LinkedIn or Medium!

References

--

--

Indraneel Dutta Baruah
Nerd For Tech

Striving for excellence in solving business problems using AI!