A Quick intro about Spark

Vaishali S
6 min readMay 3, 2022

--

Hi Everyone Do you learned about Apache Spark? It is designed for fast performance and uses RAM for caching and processing data.

Let’s learn about it…

What is Spark?

Apache Spark is an open-source cluster computing framework. Its primary purpose is to handle the real-time generated data.

Spark was built on the top of the Hadoop MapReduce. It was optimized to run in memory whereas alternative approaches like Hadoop’s MapReduce writes data to and from computer hard drives. So, Spark process the data much quicker than other alternatives.

History of spark :

Features of Apache Spark:

Fast → It provides high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Easy to Use → It facilitates to write the application in Java, Scala, Python, R, and SQL. It also provides more than 80 high-level operators.

Generality → It provides a collection of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.

Lightweight → It is a light unified analytics engine which is used for large scale data processing.

Limitations of Apache Spark

➸ No File Management system

➸ No Support for Real-Time Processing

➸ Small File Issue

➸ Cost-Effective

➸ Window Criteria

➸ Latency

➸ Less number of Algorithms

➸ Iterative Processing

➸ Manual Optimization

➸ Back Pressure Handling

Spark Eco-System

As you can see from the below image, the spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.

Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing.

Spark Streaming
Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams.

Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language.

GraphX
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.

MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.

SparkR
It is an R package that provides a distributed data frame implementation. It also supports operations like selection, filtering, aggregation but on large data-sets.

Components of spark

  • Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
  • Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages.
  • Tasks: Each stage has some tasks, one task per partition.
  • DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
  • Executor: The process responsible for executing a task.
  • Driver: The program/process responsible for running the Job over the Spark Engine
  • Master: The machine on which the Driver program runs
  • Slave: The machine on which the Executor program runs

Spark Architecture Overview

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries.

Apache Spark Architecture is based on two main abstractions:

  • Resilient Distributed Dataset (RDD)
  • Directed Acyclic Graph (DAG)

➽ Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes. Here,

  • Resilient: Restore the data on failure.
  • Distributed: Data is distributed among different nodes.
  • Dataset: Group of data.

➽ Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers the navigation whereas directed and acyclic refers to how it is done.

What is Cluster Manager in Apache Spark?

Cluster manager is a platform where we can run Spark. Simply put, cluster manager provides resources to all worker nodes as per need, it operates all nodes accordingly.

We can say there are a master node and worker nodes available in a cluster. That master nodes provide an efficient working environment to worker nodes.

There are three types of Spark cluster manager. Spark supports these cluster manager:

  1. Standalone cluster manager- It is a part of spark distribution and available as a simple cluster manager to us. Standalone cluster manager is resilient in nature, it can handle work failures. It has capabilities to manage resources according to the requirement of applications.
  2. Hadoop Yarn- This cluster manager works as a distributed computing framework. It also maintains job scheduling as well as resource management. In this cluster, masters and slaves are highly available for us. We are also available with executors and pluggable scheduler.
  3. Apache Mesos- It is a distributed cluster manager. As like yarn, it is also highly available for master and slaves. It can also manage resource per application. We can run spark jobs, Hadoop MapReduce or any other service applications easily.

Spark Memory Management:

It is the smallest unit of execution that operates on a partition in our dataset. Given that Spark is an in-memory processing engine where all of the computation that a task does happens in-memory.

Execution Memory:

  • Execution Memory is the memory used to buffer Intermediate results.
  • As soon as we are done with the operation, we can go ahead and release it. Its short lived.
  • For example, a task performing Sort operation, would need some sort of collection to store the Intermediate sorted values.

Storage Memory:

  • Storage memory is more about reusing the data for future computation.
  • This is where we store cached data and its long-lived.
  • Until the allotted storage gets filled, Storage memory stays in place.
  • LRU eviction is used to spill the storage data when it gets filled.

When to Use and When to Not Use Apache Spark?

Features set is more than enough to justify the pluses of using Apache Spark for Big Data Analytics, yet to justify the scenarios when and when not to use Spark is necessary to provide broader insights.

Can Use for -

Deployment modules that are co-related with Data Streaming, Machine Learning, Collaborative Filtering Interactive Analysis, and Fog Computing should surely use the perks of Apache Spark to experience a revolutionary change in decentralized storage, data processing, classification, clustering, data enrichment, complex session analysis, triggered event detection and ETL streaming.

Can’t Use for -

Spark is not fit for a multi-user environment. Spark as of now is not capable of handling more users concurrency, maybe in future updates this issue will be overcome. Yet an alternate engine like Hive for handling large batch projects.

Let’s Catch you all in next blog… Any Questions? Please pin me in the comment section, will get back to you :)

Image Resources:

https://data-flair.training/blogs/spark-tutorial/

--

--