Power Of Spark In DataScience

So what is Spark?

Apache Spark is an Open Source Big Data framework. It is faster and more general purpose data processing engine and is basically designed for fast computation. It covers a wide range of workloads such as batch, interactive, iterative and streaming.

Apache Spark is a lightning fast cluster computing tool. Spark runs applications in Hadoop clusters up to 100x faster in memory and 10x faster on disk. Spark makes it possible by reducing the number of read/write cycle to disk and storing intermediate data in-memory.

Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known cluster has over 8000 nodes. Indeed, Spark is a technology well worth taking note of and learning about.

Components of Spark

Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL (Architecture Of Spark). This also makes things much easier for developers. These libraries are integrated, so improvements in Spark over time provide benefits to the additional packages as well. Most data analysts would otherwise have to resort to using lots of other unrelated packages to get their work done, which makes things complex. Spark’s libraries are designed to all work together, on the same piece of data, which is more integrated and easier to use. Spark streaming in particular provides a way to do real-time stream processing. The aforementioned Apache Storm project was designed to do this kind of work, but Spark is much easier to develop for than Storm. Spark will enable developers to do real-time analysis of everything from trading data to web clicks, in an easy to develop environment, which tremendous speed.

Architecture Of Spark

RDD stands for “Resilient distributed dataset”. It is the fundamental data structure of Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster.

Decomposing the name RDD:

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures.
  • Distributed, since Data resides on multiple nodes.
  • Dataset represents records of the data you work with. The user can load the dataset externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the case of failure.

RDD Can be created by 2 ways:

1.Parallelizing existing collection.

2.Loading external dataset from HDFS (or any other HDFS supported file types).

Features of RDD in Spark

i. In-memory computation

Spark RDDs have a provision of in-memory computation. It stores intermediate results in distributed memory(RAM) instead of stable storage(disk).

ii. Lazy evaluations

All transformations in Apache Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset.

Spark computes transformations when an action requires a result for the driver program. Follow this guide for the deep study of Spark Lazy Evaluation.

iii. Fault Tolerance

Spark RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers how it was created from other datasets (by transformations like a map, join or groupBy) to recreate itself. Follow this guide for the deep study of RDD Fault Tolerance.

iv. Immutability

Data is safe to share across processes. It can be created or retrieved anytime which makes caching, sharing & replication easy. It is a way to reach consistency in computations.

v. Partitioning

Partitioning is the fundamental unit of parallelism in Spark RDD. Each partition is one logical division of data which is mutable. One can create a partition through some transformations on existing partitions.

vi. Persistence

Users can state which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk).

vii. Coarse-grained operations

It applies to all elements in datasets through maps or filter or group by operation.

viii. Action/Transformations

All computations in Spark RDD are either actions or transformations.

Basic Programming Examples:

The Spark Context is the main entry point for Spark functionality. A SparkContext instance connects us to the Spark Cluster and is used to create just about everything in that cluster. If everything has started up correctly, it becomes available as sc in application.

Create simple RDD

The most common way of creating an RDD is to load it from a file. Notice that Spark’s textFile can handle compressed files directly.

we can do to check that we got our RDD contents right is to count() the number of lines loaded from the file into the RDD.

Another way of creating an RDD is to parallelize an already existing list.

Below are articles for More Information:

https://spark.apache.org/docs/latest/programming-guide.html

https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/

https://jaceklaskowski.gitbooks.io/mastering-apache-spark-2/spark-rdd.html