What is Apache Spark?

Ashish Shah
5 min readMay 1, 2018

--

Most of you would have heard about Apache Spark and other big data technologies(Hadoop, Hive, Cassandra) used around the world. But what is it about Spark that has made it explode onto the scene?

Source: EasyLearning

On its website, Apache Spark is explained as a ‘fast and general engine for large-scale data processing’. But that doesn’t even begin to encapsulate the reason it has become such a prominent player in the big data space. Its adoption by big data companies has been on the rise at an eye-catching rate. Understanding the reasons behind Spark’s rise can help predict the trajectory of upcoming big data solutions.

is a fast and general-purpose cluster computing system for real-time processing. It was developed at the AMPLab at U.C. Berkeley in 2009 and later donated and open-sourced by Apache. It has a thriving open-source community and is the most active Apache project at the moment.

Spark vs Hadoop

A common question asked by developers is whether Spark is just another technology competing against Hadoop(meant to replace Hadoop) or will they be able to keep their existing codebase and use Spark on top of it to leverage it’s capabilities.

Well, yes and no. It was built with Hadoop in mind. It is a potential replacement for Hadoop MapReduce but it also extends the MapReduce model to work well with YARN and HDFS. Spark can run on top of HDFS to leverage the distributed replicated storage.

Big Data Tech Stack

What sets Spark apart?

  1. Speed: In-memory processing
    With the vast growth of information, computing efficiency becomes very critical to help address high time and space complexity. Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of physical or virtual servers. It does this through in-memory processing, which is what makes it capable of delivering real-time analytics at lightning speed (100x faster than Hadoop).
  2. Support: Developer-friendly APIs
    Spark provides for lots of instructions that are a higher level of abstraction than what MapReduce provided. It not only supports Java, but also Python and Scala, which is a newer language that contains some attractive properties for manipulating data. It has developer-friendly APIs and is often used alongside Hadoop’s data storage module — HDFS — but it can integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MongoDB and Amazon’s S3.
  3. Simple: Lazy Evaluation
    Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. it listens to what you ask it to do, writes down some shorthand for it so it doesn’t forget, and then does absolutely nothing. It will continue to do nothing, until you ask it for the final answer. For transformations, Spark adds them to a DAG (Directed Acyclic Graph) of computation and only when the driver requests some data, does this DAG actually gets executed.

Spark at its best (Features) ❤

The heart of Apache Spark is powered by the concept of Resilient Distributed Dataset (RDD). It is a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. This is how Spark can achieve fast and scalable parallel processing so easily.

Spark Core mainly comprises of the following 4 libraries.

Spark Core Components

Spark SQL

Spark SQL has become more and more important to the Apache Spark project. It is the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas).

Selecting some columns from a dataframe is as simple as this line:

soccerPlayersDF.select(“name”,”age”,”country”)

After registering the dataframe as a temporary table, we can use SQL to query it.

soccerPlayersDF.createOrReplaceTempView(“soccer”)
spark.sql(“SELECT name, age, country FROM soccer”)

Spark Streaming (Analytics)

Nowadays, most of the data we see is generated in “streams” from various sources. While it is certainly feasible to store these data streams on disk and analyze them in batches later on, it can sometimes be sensible or important to process and act upon the data as it arrives in real-time.

Earlier, stream processing in Hadoop had to be done in cumbersome manner. Developers would use MapReduce for batch processing and then had to use Apache Storm for real-time processing. This lead to problems as now, developers had to manage codebases that ran on different frameworks and needed to be kept in sync with each other.

Spark introduced the concept of batch processing into streaming by breaking the stream down into a continuous series of micro batches, which could then be manipulated using the Apache Spark API. This reduced the overhead as now both the batch and streaming operations shared most of the code and ran on the same framework.

Source: Datanami

Spark MLLib

Spark MLLib includes a framework for creating machine learning models right from preprocessing, feature extraction, selections, and transformations on any dataset. Spark’s ability to store data in memory and rapidly run repeated queries makes it a good choice for training machine learning algorithms and reduce the training time. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline.

Note: The Apache Spark MLLib doesn’t have implementations for modeling or training deep learning algorithms.

Spark GraphX

This libary comprises of various distributed algorithms for processing graph structures. These algorithms use Spark Core’s RDD approach to modeling data; the GraphFrames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.

Applications of Apache Spark

Databricks (a company founded by the creators of Apache Spark) lists the following usecases for Spark:

  1. Data integration and ETL
  2. High performance batch computation
  3. Machine learning analytics
  4. Real-time stream processing

These applications aren’t new at all. But they are going to be much faster on Spark.

Thanks for reading!

If you’re interested, check out my article on How to get started with Apache Spark!

Source: Infoworld, Mapr, Quora, Datafloq and our beloved, Wikipedia

--

--