Apache Spark | A Processing Friend

Gobalakrishnan Viswanathan
The Startup
Published in
7 min readAug 7, 2020

Apache Spark is an open-sourced, distributed data processing system for big data applications that follows the in-memory caching technique for fast response almost against any data size. From Its official site,

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs

Four advantages of Apache Spark from its developers,

1. Speed

Runs approx. 100X faster than its competitor Hadoop Eco. It achieves high performance for both Batch and Streaming data.

2. Ease of Use

Supports over 80 high-level operators to build parallel apps including industry rulers like Scala, Python, R, SQL and more

3. Generality

Spark comes up with a pack of libraries for SQL, Machine Learning, Streaming, and Graphing. These applications can be used in the same application seamlessly.

4. Runs Everywhere

No need for any special infrastructure requirements. Runs on already available environments like Hadoop, Mesos, Kubernetes, Cloud, or can be run as a standalone.

Why is Spark winning Hearts?

The main reason for using Apache Spark is It’s Unified Engine. Before the entry of Spark, there are so many tools that were used to do specific jobs on the top of HDFS-MapReduce.

Why Apache Spark over other tools? (Image Credits to GreatLearning.in team)

Before Apache Spark, There was a king who started this Big Data Processing Era named HDFS-MapReduce which was used to store and process a large volume of the data. But when time goes, data grows with its very own friend complexity. This brought the difficulty to handle different types of requirements. Lots of tools developed to take care of different needs. For example,

  1. Impala: MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in a Hadoop cluster.
  2. Storm: Tool for real-time data processing
  3. Mahout: Used for creating scalable machine learning algorithms
  4. Drill: Low latency distributed query engine for large-scale datasets
    and so on tools …

So, It became very difficult to manage all these requirements for data processing and manipulating. Here Apache Spark comes in. Its Unified Engine can do the job of all these tools.

Spark Core is the underlying general execution engine for the Spark platform that all other functionalities like Spark SQL, Spark Streaming, MLib and GraphX uses.

Yes, obviously SPEED is one more important reason to pick SPARK.

Spark Architecture:

Once again thank you for the great picture from great learning.in. The below one explains the Core Architecture of the Spark neatly.

Architecture of Apache Spark. (Thanks again for the picture greatlearning.in)

So get into the image. The centered red-yellow thing here is our loved SPARK.
The two things on it referring to the ROM and RAM of the environment the Spark belongs to. Those are important because Spark uses them much effectively so that its response is speedy.

The first layer is about the languages supported by Apache Spark.

  • Scala (Spark itself written in Scala !)
  • Python
  • Java
  • R
  • SQL

Second layer about the libraries available in Spark.
There is a main part of the Spark is SPARK CORE. This is not a library of Spark but itself is a Spark Engine Which connects all the libraries. Totally 4 libraries are available in the Spark package. (Most of the below definitions are taken from Databricks website.)

  1. Spark SQL:
    Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
  2. MLib:
    Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives
  3. GraphX:
    GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction.
  4. Spark Streaming:
    Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming provides a high-level abstraction called DStream, which represents a continuous stream of data.

The Final layer is about the different modes of how we can use Spark.

  1. Local Mode:
    We can run Spark in the local machine itself which can be useful for development, testing purposes.
  2. Standalone Mode:
    When we wanted to install Spark in the servers but we don't want to use other tools such as YARN or MESOS. This is like local mode in the cluster without any further tools attached to the Spark.
  3. YARN:
    When Spark needs to be run in the Cluster, And there is a need of Cluster management tool, then YARN comes in. Simply, YARN is a generic resource-management framework for distributed workloads.
  4. Mesos:
    Apache Mesos is a centralized, fault-tolerant cluster manager, designed for distributed computing environments. It provides resource management and isolation, scheduling of CPU & memory across the cluster.

ZOO-KEEPER:

Zookeeper is something like a communication bridge between the instances of the cluster.

Zookeeper is the tool that makes sure of the high availability of the Spark Cluster. Zookeeper initializes the standby instance as a master when there is a failure in the current master, Recovers the older master’s state, and resumes the scheduling.

YARN is the resource utilization tool that does resource allocation, co-ordination between the resources, and scheduling whereas ZOOKEEPER is a centralized service for maintaining configuration information, naming, providing distributed synchronization.

Some Sparky Details:

  1. Spark is not a Storage system. It is just an execution Engine. Spark can get the data from other Storage systems like HDFS, local machine, etc.
  2. we can use Yarn, Mesos, or Kubernetes as a resource manager for Apache Spark. In Standalone mode, Spark itself a resource manager.
  3. Spark can read data from many storage systems like HDFS, AWS S3, local storage, and more. And in the same way, it can write back the data to many storages too.
  4. In Spark Streaming, Spark can get the data from the Flume Agent or Kafka queue. Spark itself can receive the streaming data from the sources. But to avoid the data lost due to instance down, It is always better to use other sources to get the streaming data. Now when there is a failure of the receiver instance in the Spark Cluster, the data will not be lost and It can be read by another Spark instance from Flume/Kafka.
Hadoop vs Spark (Credits to greatlearning.in)

Here, Tachyon is something that stands between frameworks like Spark and Storage systems. Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster jobs.

Apache Spark Windows 10 Installation

  • Spark is written in Scala. Scala source code is intended to be compiled to Java bytecode so that the resulting executable code runs on a Java virtual machine. Spark requires Java 8. Open the command prompt,
    type java -version to check Java is available in the system. If Java not available, go to this link Java and download it.
Java installation check
  • Spark can work with languages like Scala, Python, Java, and R. Since Spark is pre-built with Scala, we can continue with the Spark download now. Spark with other languages is covered later in this post.
  • Download the required Spark distribution from Spark_Download.
    Extract it to the path needed.
  • Spark comes with inbuilt Hadoop. But to make this work, we need winutils.exe file. Head to this Github Link, Inside <hadoop_version>/bin path, search for winutils.exe, download it.
  • Once download the file, create a folder hadoop, inside it create a folder named bin, paste the downloaded winutils.exe file. Now the winutils file should be in <selected_path>/hadoop/bin path.
  • Now we have to update the environment variables to make Spark and Hadoop work. Press the Windows button, type Edit Environment Variables for your Account” & click it. As given in the below picture,
Updating ENV variables for Spark and Hadoop
  • Using the New button, add SPARK_HOME and HADOOP_HOME variables. Click OK to update the changes. Now we can use the Spark util scripts.
  • Now go the <spark_extracted_folder>/bin and open the command line, type “spark-shelland hit Enter. Prompt should give output something like below.
spark-shell output
  • If python installed in the system, Use “pyspark” to open spark in python-mode. The output will like mostly the same as Scala-mode.
In both Scala, Python version images, command throwing two WARN, unable to load native library & Exception when trying to compute page size. I will update if I solved the warnings in future..

That's all as of now folks on the introduction to Apache Spark and Its environment pool installation. I am planning to write one more post for Hands-on with Spark on top of Scala. Hope we meet again soon on another day. ta ta !!

--

--