Machine Learning in Spark-1: Understanding Spark and RDDs

Basic concepts and code snippets in python


What is Apache Spark ?

Apache Spark is an open source big data framework built around speed, ease of use and sophisticated analytics.

Apache Spark provides programmers with an API centered on a data structure called the resilient distributed dataset (RDD).

In Spark, all the data and computation power is distributed across its nodes.

Basic spark architecture
What are RDDs ?

RDD stands for Resilient Distributed Datasets.

RDDs are the basic data structure of Spark.

A single RDD is split and distributed across the worker nodes. They are stored in the RAM of each node.

Key features of RDDS:

  1. Resilient: RDDs are fault tolerant
  2. Distributed: Each RDD is distributed across multiple worker nodes
  3. Immutable : The data present inside the RDD cannot be changed
How do RDDs work ?

RDDS are created and distributed across the nodes. The driver (spark context)distributes the code to the worker nodes which then execute the code on the RDD partition present in its RAM. The results from the nodes are then sent to the driver for aggregation.

RDD gets partitioned and sent to different nodes , computation done in each node and then results are aggregated in the driver
Python code showing how RDDs work

Ensure that the spark instance is running and pyspark has been imported before running the below code


Example 1:

sc = pyspark.SparkContext(appName=”helloworld”)
lines = sc.textFile(‘README.md’) 
lines_nonempty = lines.filter( lambda x: len(x) > 0 ) 
lines_nonempty.count()