RDD — Resilient Distributed Datasets
resilient: able to withstand failures
distributed: spanning across multiple machines
formally, a read-only, partitioned collection of records
To adhere to RDD interface, a dataset must implement:
- partitions() -> Array[Partition]
> lookup blocks information from the NameNode
> make a partition for every block
> return an array of the partitions
2. iterator(p: Partition, parents: Array[Iterator]) -> Iterator
> parents are not used
> return a reader for the block of the given partition
3. dependencies() -> Array[Dependency]
> return an empty array
4. The dataset must be typed( Typed!RDD[T] — a dataset of items of type T), this allows type checking to catch errors early on before the actual execution
2 ways to construct RDDs
- data in a stable storage
Example: files in HDFS, objects in Amazon S3 bucket, lines in text files, …
RDD for data in a stable storage has no dependencies
2. from existing RDDs by applying a transformation
Example: filtered file, grouped records, …
RDD for a transformed data depends on the source data
transformations:
> Allow you to create new RDDs from existing RDDs by specifying how to obtain new items from the existing items
> The transformed RDD depends implicitly on the source RDD
[ As transformation can be executed in any place or at any time, we’d better to only get access to the resources within the RDD, and it’s a bad idea to interact with local files]
[Datasets are immutable in spark, and you cannot modify data in-place]
Actions:
triggers data to be materialized and processed on the executors and then passes the outcome to the driver
Resiliency:
For wide dependencies, the restarts are more fragile, as it require all the dependencies to be alive to recompute the output partition, and it can be very expensive when the dependencies we evict out of a cache.
Actions have side-effects in spark:
- due to restarts, spark cannot guarantee that an action would be executed, because its execution may be fail in between
- Actions have to be idempotent that is safe to be re-executed multiple times given the same input