Quick Apache Spark Overview

Published in

The Startup

6 min readMay 2, 2020

Welcome to my post, a little about myself to start. I am currently a software engineer in data and I have worked with Apache Spark for over a year now. Here on Medium, I want to share my knowledge and experience working with Apache Spark. Apache Spark has come to rise within big data as a common technology for processing large amounts of data. My goal is to provide the nuts and bolts of Spark to those who are interested in learning Spark, who are professionally starting to use Spark and anyone who is just curious about getting familiar with new technology.

In this post, I will cover the depths of Spark engine and will go over Spark’s libraries at introductory level. You can expect that by the end of reading this, you will have a good understanding of the foundation of Spark’s data processing engine and a high level understanding of Spark’s modules.

Let’s start with a little history. Spark was created in UC Berkeley’s AMPLab in 2009 and later open sourced in 2010 and donated to Apache in 2013.

So what is it about Spark that makes it special? Spark is a cluster compute program that transforms data in memory at lightning fast speeds compared to existing technologies. Here is an overview of traits of Spark.

The Spark framework is built with Scala and it consists of Spark Core, which is the main engine that drives and manages the processing of data. On top of this engine and it’s APIs there are provided libraries. These are Spark SQL, ML Lib, GraphX and Spark Streams. I will go over these in detail in the sections below, stay tuned.

The next thing to know about Spark is that it does not have its own distributed storage system, meaning it needs to depend on another storage system for distributed computing. Spark also does not have a resource manager when running in distributed mode. Since many big data projects deal with multi-petabytes of data which need to be stored in a distributed storage, many will use Hadoop’s distributed file system (HDFS) along with its resource manager YARN. Hadoop and Spark are not mutually exclusive and can work together. However, Hadoop is not needed when running Spark in standalone mode. Other options beside HDFS and YARN are Apache Mesos, which is a popular resource manager being used in projects as an alternative to YARN, and AWS S3 as an alternative to HDFS. The question might be which setup is better ? I will compare the different setups side by side and cover this topic in greater detail in my future posts.

Given that we have covered the fundamentals of Spark’s framework, below , I am going to go through two main abstractions of Spark core in greater detail, RDD & DAG.

Resilient Distributed Dataset (RDD)

The key to understanding Apache Spark is RDD — Resilient Distributed Dataset.

R — Resilient meaning its fault tolerant If the data set in a partition is lost due to hardware failure, the data is recreated from the source and you have a full recovery of the lost data.
D — Distributed meaning the data is split into multiple partitions among cluster nodes so that they can be processed in parallel
D — Dataset represents the ingested data in the partitions

In Spark, RDD is the core data object type underneath all the API’s. During data processing, for each transformation, Spark produces and returns an RDD until the transformations are complete and ready for an output. Take a look at the figure below to get a visual understanding.

2. Directed Acyclic Graph (DAG)

In simple terms, a DAG in Spark is nothing but a graph which holds the track of operations applied on RDD.

Directed — directly connected from one node to another. This creates a sequence where each node is in linkage from earlier to later in the appropriate sequence.

Acyclic — means no cycle or loop available. Once a transformation takes place it cannot return to its earlier position.

Graph — consists of sequence of operations to be executed. These sequences look like graph format, similar to graph theory, with vertices and edges.

Visual example of a DAG sequence of operations

Next I want to dive into the libraries available to use in Spark. Below are the following Spark libraries.

Spark SQL — is a module in Spark which brings native support of SQL to Spark. This module is a very important part of Spark.It provides a programming abstraction layer called Data Frames API. Dataframe API provides SQL like methods that allow developers to express SQL, including queries that make it easier to write data processing logics and conveniently blurring the lines between RDDs and relational tables. Additionally, Spark SQL can be used for ad hoc analysis by reading data from an external data file and executing queries. This makes it just as easy to run ad hoc analysis for big data.

ML Lib — is the machine learning library in Spark which is another programming abstraction layer API consisting of common learning algorithms and utilities.The library also includes classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization. Also as of Spark 2.0 the ML Library is a DataFrame-based API which provides a more user-friendly API than RDDs.

Graph X — is the Spark API for graph and graph-parallel computations. It includes a growing collection of graph algorithms and builders to simplify graph analytic tasks. GraphX extends the Spark RDD with a Resilient Distributed Property Graph, it’s an RDD type except the data here is represented in vertices and edges. The vertices represent the objects and the edges show the various relationships between those objects. A picture below gives you a simple example of graph data representation.

Spark Streaming is another extension of Spark, this module enables for streaming of live data. Here, the data can be ingested from many sources like Kafka, Kinesis or TCP Sockets and can be processed using complex logics. Processed data can be pushed out to filesystems, databases or live dashboards. Further, you can even apply Spark’s machine learning and graph processing algorithms on data streams. Under the hood, Spark Streaming provides an abstraction called DStream which stands for discretized stream. This represents a continuous stream of data as a sequence of RDDs. A DStream can be created either from input data streams from sources such as Kafka, and Kinesis

So by now hopefully you found the above information is sufficient enough to understand the architecture of Apache Spark framework as well as the introductions to Spark high level abstraction libraries.

In the next posts, I will cover each library in more depth as well as common optimization capabilities based on use cases including best practices and setting up Spark’s environment.

Quick Apache Spark Overview

Written by Shivakanth reddy