Understanding Spark – I
- What is Spark? -
- Spark vs MapReduce -
- Lazy evaluation-
- What are RDD’s?-
- RDD creation – Python script -
Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency (essentially quick processing) processing that a typical MapReduce program cannot provide, Spark is the way to go.
It is based on the same HDFS file storage system as Hadoop, so you can use Spark and MapReduce together if you already have significant investment and infrastructure setup with Hadoop. You can also combine the Spark processing with Spark SQL, Machine Learning and Spark Streaming as well.
What are the Spark use cases?
Databricks (a company founded by the creators of Apache Spark) lists the following cases for Spark:
- Data integration and ETL (Extract – Transform – Load)
- Interactive analytics or business intelligence.
- High performance batch computation.
- Machine Learning and advanced analysis.
- Real-time stream processing.
What are the key differences between Spark and MapReduce?
1. Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark several times faster.
2. It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.
3. Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL (see image below). This also makes things much easier for developers. These libraries are integrated, so improvements in Spark over time provide benefits to the additional packages as well.
Lazy Evaluation in Spark — Saves on time and resources
Spark is more intelligent about how it operates on data. Spark supports lazy evaluation. Lazy evaluation means that if you tell Spark to operate on a set of data, it listens to what you ask it to do, writes down some shorthand for it so it doesn’t forget, and then does absolutely nothing. It will continue to do nothing, until you ask it for the final answer.
Simplifying this -
This is a bit like when you were in high school, and your mom came in to ask you to do a chore (“fetch me some milk for tonight’s dinner”). Your response: say that you were going to do it, then keep right on doing what you were already doing. Sometimes your mom would come back in and say she didn’t need the chore done after all (“I substituted water instead”). Work saved!
Spark is the same. It waits until you’re done giving it operators, and only when you ask it to give you the final answer does it evaluate, and it always looks to limit how much work it has to do. Suppose you first ask Spark to filter a petabyte of data for something — say, find you all the point of sale records for the Chicago store — then you ask for it to give you just the first result that comes back. If Spark were to run things explicitly as you gave it instructions, it would load the entire file, then filter for all the Chicago records, then once it had all those, pick out just the first line for you. That’s a huge waste of time and resources. Spark will instead wait to see the full list of instructions, and understand the entire chain as a whole. If you only wanted the first line that matches the filter, then Spark will just find the first Chicago POS record, then it will emit that as the answer, and stop. It’s much easier than first filtering everything, then picking out only the first line.
RDD’s in Spark
Distributed Datasets (RDD) is a fundamental data structure of Spark. An RDD is a read-only (immutable — unchangeable), partitioned collection of records. RDD’s can be created through deterministic operations on either data on stable storage or other RDD’s.
- RDD’s cannot be changed once they are created — they are immutable.
- Partitioned collections of objects spread across a cluster, stored in memory or on disk — hence the term distributed.
- Spark tracks lineage information to enable the efficient recomputation of any lost data if a machine should fail or crash — hence the term resilient.
RDD Creation — Python Script
An RDD can be created in one of two ways :
1). You can either .parallelize(…) a collection (list or an array of some elements).
# Create a python collection with 10000 elements
data = range(1, 10001)
#Create an RDD named “data_RDD”, with 8 partitions
data_RDD = sc.parallelize(data, 8)
2). Or you can reference a file (or files) located either locally or somewhere externally:
#Consider you have a file called XYZ.txt
#Create an RDD named “data_from_file”, with 4 partitions
data_from_file = sc. textFile(‘/Users/drabast/Documents/XYZ .txt’,4)
You can check out the links below for more material: