Machine Learning in Spark-1: Understanding Spark and RDDs
Basic concepts and code snippets in python
What is Apache Spark ?
Apache Spark is an open source big data framework built around speed, ease of use and sophisticated analytics.
Apache Spark provides programmers with an API centered on a data structure called the resilient distributed dataset (RDD).
In Spark, all the data and computation power is distributed across its nodes.
What are RDDs ?
RDD stands for Resilient Distributed Datasets.
RDDs are the basic data structure of Spark.
A single RDD is split and distributed across the worker nodes. They are stored in the RAM of each node.
Key features of RDDS:
- Resilient: RDDs are fault tolerant
- Distributed: Each RDD is distributed across multiple worker nodes
- Immutable : The data present inside the RDD cannot be changed
How do RDDs work ?
RDDS are created and distributed across the nodes. The driver (spark context)distributes the code to the worker nodes which then execute the code on the RDD partition present in its RAM. The results from the nodes are then sent to the driver for aggregation.
Python code showing how RDDs work
Ensure that the spark instance is running and pyspark has been imported before running the below code
sc = pyspark.SparkContext(appName=”helloworld”)
lines = sc.textFile(‘README.md’)
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines = sc.textFile(“python.txt”)
pythonLines = lines.filter(lambda line: “Python” in line)
print(“No of lines containing ‘Python’:”, pythonLines.count())
filein = sc.textFile(‘/home/suvir/Documents/SparkFiles/treefile’)
print(‘number of lines in file: %s’ % filein.count())
Link to part-2 : Transformations and Actions on RDDs
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California…en.wikipedia.org
Apcahe Spark has already taken over Hadoop (MapReduce) because of plenty of benefits it provides in terms of faster…saurzcode.in