Apache Spark

Apache Spark is an open source cluster computing framework. Initially it was developed at AMP Lab at University of California, Berkeley. Later it was maintained by APF (Apache Software Foundation). Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Distribution Computing

The important things that needs to be worried when dealing with parallelism in the shared memory case:

partial failure: crash of a subset of machines.

latency: (memory latency << disk latency << network latency).

Mains roles

Keeping all the data in-memory and immutable.

Operations performed on data are functional transformations.

By replaying the functional transformations over original dataset, fault tolerance is achieved.

Main characteristics of Apache Spark

Batch and streaming process of data.

Process all kinds of data.

In memory data processing and sharing.

A program is a DAG (Directed Acyclic Graph).

Iterations and interactive analytics can be performed.

Rich in API, more than 80 high level functions available for data processing which include map and reduce.

MapReduce and Apache Hadoop disadvantages

Involves lot of interactions with filesystem(HDFS) and network.

Only there are two operations available map and reduce operations.

Apache spark Advantages over Hadoop and map reduce

Spark can be 100x faster than Hadoop for processing large scale data by exploiting its in memory computing and other optimizations.

Spark also process the data stored in disk very fast.

Spark holds currently the world record for large scale on disk sorting.

Very easy to use, since it has rich API, large datasets can be easily operated.

Apache spark engine has high level libraries, include support for queries, streaming data, machine learning and graph processing. These standard libraries help in increase of developer’s productivity, complex workflows can be smoothly combined.

Resilient Distributed Dataset(RDD)

It is the basic abstraction in Spark.

RDDs looks a lot like immutable sequential or parallel Scala collections.

A distributed collection with API similar to List in Scala.

It is distributed data parallelism model.

RDDS are immutable.

RDDS are partitioned over a cluster of nodes.

By default, RDD are recomputed each time when we run action on them.

Ways of creating RDD

Transforming the existing RDD.

And also from Spark Context (parallelize, textfile).

Operations in Spark


Transformations are always lazy, returns a new RDD, execution of transformation is delayed until it finds any action related.


This is a final state, returns a value. It won’t return any RDD.