Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
The advantages of Spark are its
· Ease of use,
· Generality etc.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. Apache Spark has already taken over Hadoop (MapReduce) because of plenty of benefits.
Resilient Distributed Datasets (RDD):
· Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
· RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
· RDD is Resilient, i.e. fault-tolerant with the help of RDD DAG and thus can recompute missing or damaged partitions due to node failures in Spark.
· Distributed, since Data resides on multiple nodes and dataset represents records of the data you work with.
Iterative Processing in MapReduce:
RDDs, tries to solve these problems by enabling fault tolerant distributed In-memory computations.
x = sc.parallelize([“spark”, “rdd”, “example”, “sample”, “example”], 2)
y = x.map(lambda x: (x,1))
Output : [(‘spark’, 1), (‘rdd’, 1), (‘example’, 1), (‘sample’, 1), (‘example’, 1)]
# Another example of making tuple with string and it’s length
>>> y = x.map(lambda x: (x,len(x)))
Output : [(‘spark’, 5), (‘rdd’, 3), (‘example’, 7), (‘sample’, 6), (‘example’, 7)]