RDD

Z.
3 min readDec 6, 2018

--

RDD — Resilient Distributed Datasets

resilient: able to withstand failures

distributed: spanning across multiple machines

formally, a read-only, partitioned collection of records

To adhere to RDD interface, a dataset must implement:

  1. partitions() -> Array[Partition]

> lookup blocks information from the NameNode

> make a partition for every block

> return an array of the partitions

2. iterator(p: Partition, parents: Array[Iterator]) -> Iterator

> parents are not used

> return a reader for the block of the given partition

3. dependencies() -> Array[Dependency]

> return an empty array

4. The dataset must be typed( Typed!RDD[T] — a dataset of items of type T), this allows type checking to catch errors early on before the actual execution

2 ways to construct RDDs

  1. data in a stable storage

Example: files in HDFS, objects in Amazon S3 bucket, lines in text files, …

RDD for data in a stable storage has no dependencies

2. from existing RDDs by applying a transformation

Example: filtered file, grouped records, …

RDD for a transformed data depends on the source data

transformations:

> Allow you to create new RDDs from existing RDDs by specifying how to obtain new items from the existing items

> The transformed RDD depends implicitly on the source RDD

[ As transformation can be executed in any place or at any time, we’d better to only get access to the resources within the RDD, and it’s a bad idea to interact with local files]

[Datasets are immutable in spark, and you cannot modify data in-place]

Actions:

triggers data to be materialized and processed on the executors and then passes the outcome to the driver

Resiliency:

For wide dependencies, the restarts are more fragile, as it require all the dependencies to be alive to recompute the output partition, and it can be very expensive when the dependencies we evict out of a cache.

Actions have side-effects in spark:

  1. due to restarts, spark cannot guarantee that an action would be executed, because its execution may be fail in between
  2. Actions have to be idempotent that is safe to be re-executed multiple times given the same input

--

--

Z.
0 Followers

This world is beautiful. But I don’t deserve it.