Apache Spark: RDDs, DataFrames, Datasets

4 min readDec 25, 2021

Spark APIs — Summarized by new Starter

Introduction

As a new starter or beginner in Spark, I was curious what RDD, DataFrame, Dataset are. And, I want to know more details about them.

This article is the summary of these two links.

Characteristics of RDD ( Resilient Distributed Dataset )

Logical Distributed Data abstraction
Resilient / Immutable
Compile-time Type-safe
Unstructured / Structured Data
Lazy Evaluations

1. Logical Distributed Data Abstraction

2. Resilient / Immutable

What is Resilient? RDD can be recovered ( recreated ) in any point of time in the execution if something goes wrong by archiving/keeping track of the “lineage” of each RDD (the sequence of operations that produced it)

3. Compile-time Type-safe

4. Unstructured/structured data

5. Lazy Operation

Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). The entire DAG is executed when Action is executed. It means whenever Action is executed, it creates a new RDD.

Why Used RDDs?

offering control & flexibility
offering Low-Level API ( RDD is the lowest API )
offering Type-safe
encouraging “how-to” do something

When to use RDDs?

when you care of “Low-level API & control of dataset”
when you are “dealing with unstructured data(media streams or texts)”
when you are “manipulating data with lambda functions than DSL ( Domain Specific Language, High Level API )”
when you “don’t care schema or structure of data”
when you “don’t care about optimization, performance & inefficiencies”

What are the problem in using RDDs?

expressing how-to solution, not what-to
Not optimized by Spark ( because Spark doesn’t know about your lambda function and what kind of data the lambda function deals with )
Slow for non-JVM languages like Python
Inadvertent inefficiencies ( e.g. The bottom of screenshot ): This code should be optimized by change the order of execution ( filter > reduceByKey ). However, Spark just executes the code since it is given by User.

Background: What is in an RDD?

Dependencies: to create a lineage of the transformations to know how to go from Point A to Point B.
Partitions ( with optional Locality Info )
Compute function: Partition => Iterator[T]: Opaque Computation& Opaque Data: Basically, it’s not possible for Spark to optimize since the type of data and the type of operation is not clear. Therefore, Spark just serializes the data and send over to executor and just run it.