Apache Spark …The big data platform that crushed Hadoop ?

R RAMYA
8 min readMay 3, 2022

--

Fast, flexible, and developer-friendly, Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing, and machine learning

Really interesting to hear it right ? Ready to dive in and learn Apache Spark?

image

“Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. Spark is the most actively developed open source engine for this task”.

→ Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.

It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

→ The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark vs. Hadoop: Why use Apache Spark?

image

Evolution of Apache Spark

Spark SQL

Spark SQL is a Spark library for structured data processing. Spark SQL brings native SQL support to Spark as well as the notion of DataFrames. Information workers are free to use either interface or toggle between both while the underlying execution engine remains the same.

Spark Internals

This is how the entire execution of the Spark job happens.

image

Components of Spark

Job:

A piece of code that reads input from HDFS or local to perform some computational operations.

Stages:

Jobs divided as stages. Stages are further classified as map or reduce to perform computational operators. All computational operators cannot be updated as single stages.

Tasks:

Each stage has some tasks, one task per partition.

one tasks →one partitions → one executor

DAG:

Directed Acyclic Graph, is the present context.

Executor :

The process responsible for executing a task .

Driver:

The process responsible for running the job over the Spark Engine.

Master :

The machine on which the Driver Programs runs.

Slave:

The machine on which the Executor Programs runs.

Ecosystem

Spark Core

Spark Core is the underlying execution engine that all other functionality is built on top of. Spark Core provides basic functionalities such as task, scheduling, memory management, fault recovery, etc as well as Spark’s primary data abstraction — Resilient Distributed Datasets (RDDs).

Spark Streaming

Spark Streaming can ingest and process live streams of data at scale. Since Spark Streaming is an extension of the core Spark API, streaming jobs can be expressed in the same manner as writing a batch query.

MLlib (Machine Learning)

Spark’s Machine Learning Library (MLlib) provides a common set of algorithms (classification, regression, clustering, etc) and utilities (feature transformations, ML pipeline construction, hyper-parameter tuning, ML persistence, etc) to perform machine learning tasks at scale.

GraphX (Graph Processing)

GraphX is a Spark library that allows users to build, transform and query graph structures with properties attached to each vertex and edge.

Apache Spark Architecture

Working on the Apache Spark Architecture:

Driver Program in the Apache Spark architecture calls the main program of an application and creates Spark Context.

A Spark Context consists of all the basic functionalities:

Spark Driver contains various other components such as

1. DAG Scheduler

2. Task Scheduler

3. Backend Scheduler

4. Block Manager

→ Whenever an RDD is created in the Spark Context, it can be distributed across many worker nodes and can also be cached there.

→ Worker nodes execute the tasks assigned by the Cluster Manager and return it back to the Spark Context.

Modes of Execution

There are three modes of execution

3. Local Mode

Features of Apache Spark

Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this fault tolerance by using DAG and RDD

Dynamic nature: Spark offers over 80 high-level operators that make it easy to build parallel apps.

Lazy Evaluation: Spark does not evaluate any transformation immediately. All the transformations are lazily evaluated.

Speed: Spark enables applications running on Hadoop to run up to 100x faster in memory and up to 10x faster on disk.

Real Time Stream Processing: Spark Streaming brings Apache Spark’s language and integrated API’s to stream processing, letting you write streaming jobs the same way you write batch jobs.

Reusability: Spark code can be used for batch-processing, joining streaming data against historical data as well as running ad-hoc queries on streaming state.

Advanced Analytics: Apache Spark has rapidly become the de facto standard for big data processing and data sciences across multiple industries. Spark provides both machine learning and graph processing libraries

In Memory Computing: Unlike Hadoop MapReduce, Apache Spark is capable of processing tasks in memory and it is not required to write back intermediate results to the disk.

Storage?

Spark leverages existing distributed file systems like Hadoop HDFS or cloud storage solutions like AWS S3 or even Big Data Databases like Cassandra etc for large data sets.

==> Spark can also read the data from the local file system but it’s not ideal as it means we need to make data available in all nodes in the cluster.

Spark Built on Hadoop

There are three ways of Spark deployment as explained below.

  • Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
  • Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack.
  • Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

Spark Memory Management:

Storage Memory:

Storage Memory is used for storing all of the cached data, broadcast variables, and unroll data etc.

→ “unroll” is essentially a process of deserializing a serialized data.

Reserved Memory:

Reserved Memory is the memory reserved for system and is used to store Spark’s internal objects.

User Memory:

User Memory is the memory used to store user-defined data structures, Spark internal metadata, any UDFs created by the user, the data needed for RDD conversion operations such as the information for RDD dependency information etc.

Executor memory

Executor acts as a JVM process launched on a worker node. So, it is important to understand JVM memory management.

JVM memory management is categorized into two types:

  1. On-Heap memory management (In-Heap memory) — Objects are allocated on the JVM Heap and bound by GC.
  2. Off-Heap memory management (External memory) — Objects are allocated in memory outside the JVM by serialization, managed by the application, and are not bound by GC.

Memory needs of a task

  1. Execution Memory
  2. Storage Memory

Example: Sort Task

Spark arbitrate between Execution Memory(EM) and Storage Memory(SM) within a Task:

Static Assignment:

Splitting the total available on-heap memory (size of your JVM) into 2 parts:

  1. Execution Memory
  2. Storage Memory.

Memory split is static.

During the running of the task, if the execution memory gets filled, it’ll spill out to disk.

If the Storage memory gets filled, LRU (Least recently Used) Memory will be spilled out/ evict LRU block to disk.

Execution Memory can only use a fraction of the memory, even when there is no storage !!!

Don’t Worry We Can Solve This Using :

Unified Memory Management

→ It allocates a region of memory as a Unified memory container that is shared by storage and execution.

→ When execution memory is not used, the storage memory can acquire all the available memory and vice versa.

→ If any of the storage or execution memory needs more space, a function called acquireMemory() will expand one of the memory pools and shrink another one.

Hope you got some idea on Spark..💥💥💥

Let’s meet you in next blog on Spark RDD…🤍

Cheers !❤

Ramya R :)

Resources:

Images:

https://www.pinterest.com/pin/692217405200056062/

--

--