Apache Spark #2

Published in

YavarTechWorks

4 min readApr 28, 2023

Hello everyone I am Bennison, In this article we are going to learn about Apache Spark.

Basically, Apache Spark is for processing the data. it is mostly used in big data processing. But other several options are also available to process the big data such as Hadoop, Storm, etc. Even there are several technologies are available, Hadoop and Storm are the leading technology in big data processing. Spark is ten times faster than Hadoop, that is a proven statement.
When you start to learn Apache Spark, you need to have some basic knowledge of Hive, Map Reduce, and the Hadoop File system. To learn Spark we need to have knowledge of either one of the following languages python, scala, or Java.

What is Spark?

The major role of Spark is processing the data. Here the data processing is going to be distributed.
The Spark core engine has four frameworks on it, that are, spark SQL, spark streaming, M lib, and graph x.
M lib is for machine learning, graph x is for graph-based analysis. Data engineers must have knowledge of Spark, Spark Stream, and Spark SQL.
Basically, Apache Spark process the data in two ways. one is batch processing and the other is streaming processing.
In Spark, Using the Spark core component, we can do batch processing, and using the Spark streaming we can do stream processing.
Before Spark came, batch processing is done by Hadoop, and stream processing is done by some other technology like Apache Storm.
By giving the SQL structure to data, we can process the data like an SQL query engine.

In batch processing, basically, we would store data somewhere, and process it after some time based on some condition.
If you need to process the data, then only we process that stored data.

In stream processing (live data processing), When you get the data, we process that data immediately, instead of storing it somewhere.
Hadoop only supports batch processing, at the same time spark supports both batch and stream processing.

We can separate the Spark framework as a Data layer.

You may not hear about data pipelines and data lakes, So here I added those use cases here. The data pipeline is used to migrate the data from one place to another place. data lake is a centralized repository to store, process, and secure a large amount of structured, semi-structured, and unstructured data.

Here the thing is, Spark SQL, Spark MLib, Spark Streamings, and Spark Graph X are comes under the data processing. So Spark is only for data processing. but Hadoop has Hive, HDFS, Sqoop, Flume, Pig, Mahout Map Reduce and etc.
In the above HDFS, Hive, and HBase are comes under data storage, and Hive, Pig, and Map Reduce are comes under Data processing. And Sqoop, Flume are helping in the Data Pipeline. Mahout is used in data science. Apache squeeze is used in data scheduling.
When we compare Spark with Hadoop, Spark only supports data processing but Hadoop supports all data layers.
Now you may ask questions like If I go with spark, and that time I need storage or pipeline, what should I do? The solution to this question comes under the next session.

Spark supports data storage like NoSQL, RDBMS, Distributed file systems, and standalone file systems.
HBase, Cassandra, and MongoDB are top NoSQL databases. Spark can be connected to these kinds of NoSQL databases. HBase database comes in Hadoop.
Spark can be connected to any type of RDBMS database by using JDBC, or ODBC,
Spark can be connected to any type of Distributed file system. HDFS has distributed filesystem. By default, Spark uses the HDFS distributed file system. By default Spark deployed on Hadoop, That’s why Spark uses HDFS as its default data storage.
Here is the important this, In a big data project we can see the project without Spark, but we can not see the project without Hadoop only with Spark. Because Spark runs on Hadoop. and Spark uses some other Hadoop framework. Spark can directly run on Windows and Linux.
We need to remember one thing, Spark is not a replacement for Hadoop, Spark is a replacement for Map Reduce in Hadoop.
We can integrate Spark with Hadoop’s all components except Map Reduce because Spark is a replacement for Map Reduce.

When I install and run the spark, some processes will be running in the background, these processes are called daemon processes.
Two background processes will be run the background, one is master, and another one is worker.

Standalone: If I install and run the spark only in my clusters without Hadoop, that is called a standalone deployment.
Yarn (Yet another negotiator): Installing Spark with Hadoop is called Yarn deployment. Another deployment method is mesos, this is also like a yarn deployment method.

I hope you all understood the concept of Apache Spark. thanks for reading this Article.