Apache Spark for beginners

Data Science With Apache Spark 2

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time between queries and waiting time to run the program.

Spark was introduced by Apache Software Foundation for speeding up the Hadoop computational computing software process.

What Apache Spark Does

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Spark vs Hadoop

The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP. Spark is designed for data science and its abstraction makes data science easier.

Spark also includes MLlib, a library that provides a growing set of machine algorithms for common data science techniques: Classification, Regression, Collaborative Filtering, Clustering and Dimensionality Reduction.

Features of Apache Spark

Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.

Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.

Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Internal architecture of Spark

Features of Apache Spark

Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.

Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.

Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Resilient Distributed Datasets(RDD)

A Resilient Distributed Dataset (RDD), is the primary data abstraction in Apache Spark and the core of Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist.

In addition, PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; Double RDD Functions contains operations available only on RDDs of Doubles; and SequenceFileRDD Functions contains operations available on RDDs that can be saved as Sequence Files.

All operations are automatically available on any RDD of the right type (e.g. RDD through implicit. Using RDD Spark hides data partitioning and so distribution that in turn allowed them to design parallel computational framework with a higher-level programming interface (API) for four mainstream programming languages.

The features of RDDs :

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  • Distributed with data residing on multiple nodes in a cluster.
  • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).
Serialization Data Split in Spark

RDDs can be invoked within Spark through Pyspark, Spark SQL or Spark Scala. Data which is ingested, or exists on the disk on the Linux file system or on the Hadoop Distributed File System (HDFS) can be taken and converted to a distributed dataset.

The key reasons RDDs are an abstraction that works better for distributed data processing, is because they don’t feature some of the issues that MapReduce, the older paradigm for data processing (which Spark is replacing increasingly). Chiefly, these are:

  1. Replication: Replication of data on different parts of a cluster, is a feature of HDFS that enables data to be stored in a fault-tolerant manner. Spark’s RDDs address fault tolerance by using a lineage graph. The different name (resilient, as opposed to replicated) indicates this difference of implementation in the core functionality of Spark
  2. Serialization: Serialization in MapReduce bogs it down, speed wise, in operations like shuffling and sorting.
Parallelization
  1. Disk IO : One of the most computationally expensive operations is writing files to disk and reading them again, and this kind of Disk input-output impacts the performance of big compute jobs. Although Apache Spark can cache and persist RDDs to save time during in-memory computation, it is primarily an in-memory processing engine that depends on cheap access to RAM (which differs from the “commodity hardware” argument that’s made for Hadoop). Disk IO is expensive and time consuming in “big compute” jobs (as opposed to “big data”, which refers to large data set storage and handling). At every stage of a map or reduce step in MapReduce, there is Disk IO, which is avoided because Spark’s resource manager and optimizer allow for fine-grained control over scheduling and resilient processing.
  2. Optimisation and Lazy Evaluation: These are mentioned together since lazy evaluation (a la Scala) allows a sequence of transformations to be performed on RDDs without actually spending compute time on them. Spark natively represents these transformations as a Directed Acyclic Graph (DAG) and Spark’s Catalyst Optimizer allows such computational graphs to be optimised and staged appropriately, based on the memory settings. Spark’s native resource manager is capable of handling various tasks by itself in conjunction with a file system, but Spark also integrates with existing resource managers in Hadoop based file systems (such as YARN).

RDD API Examples

Word Count

text_file = sc.textFile("whatsapp")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("whatsapp")

Pi Estimation

def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(xrange(0, NUM_SAMPLES)) \
.filter(inside).count()
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
a = range(100)
data = sc.parallelize(a)
data.count()
data.take(5)

Reference

www-bcf.usc.edu/~minlanyu/teach/csci599-fall12/papers/nsdi_spark.pdf

https://jaceklaskowski.gitbooks.io/mastering-apache-spark-2/spark-rdd.html

http://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf

Show your support

Clapping shows how much you appreciated Alex Anthony’s story.