Apache Spark: RDDs, DataFrames, Datasets

Life-is-short--so--enjoy-it
4 min readDec 25, 2021

--

Spark APIs — Summarized by new Starter

Introduction

As a new starter or beginner in Spark, I was curious what RDD, DataFrame, Dataset are. And, I want to know more details about them.

This article is the summary of these two links.

Characteristics of RDD ( Resilient Distributed Dataset )

  1. Logical Distributed Data abstraction
  2. Resilient / Immutable
  3. Compile-time Type-safe
  4. Unstructured / Structured Data
  5. Lazy Evaluations

1. Logical Distributed Data Abstraction

2. Resilient / Immutable

What is Resilient? RDD can be recovered ( recreated ) in any point of time in the execution if something goes wrong by archiving/keeping track of the “lineage” of each RDD (the sequence of operations that produced it)

3. Compile-time Type-safe

4. Unstructured/structured data

5. Lazy Operation

Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). The entire DAG is executed when Action is executed. It means whenever Action is executed, it creates a new RDD.

Why Used RDDs?

  • offering control & flexibility
  • offering Low-Level API ( RDD is the lowest API )
  • offering Type-safe
  • encouraging “how-to” do something

When to use RDDs?

  • when you care of “Low-level API & control of dataset”
  • when you are “dealing with unstructured data(media streams or texts)”
  • when you are “manipulating data with lambda functions than DSL ( Domain Specific Language, High Level API )”
  • when you “don’t care schema or structure of data”
  • when you “don’t care about optimization, performance & inefficiencies”

What are the problem in using RDDs?

  • expressing how-to solution, not what-to
  • Not optimized by Spark ( because Spark doesn’t know about your lambda function and what kind of data the lambda function deals with )
  • Slow for non-JVM languages like Python
  • Inadvertent inefficiencies ( e.g. The bottom of screenshot ): This code should be optimized by change the order of execution ( filter > reduceByKey ). However, Spark just executes the code since it is given by User.

Background: What is in an RDD?

  • Dependencies: to create a lineage of the transformations to know how to go from Point A to Point B.
  • Partitions ( with optional Locality Info )
  • Compute function: Partition => Iterator[T]: Opaque Computation& Opaque Data: Basically, it’s not possible for Spark to optimize since the type of data and the type of operation is not clear. Therefore, Spark just serializes the data and send over to executor and just run it.

Structured APIs

  • Datasets
  • DataFrames
  • SQL tables and views

Now Here is where Structured APIs comes in Spark

  • Providing Error Detection
  • Unifications of APIs

How does DataFrame API Code look like?

  • It is like a Table having a schema and columns with type, name.

Why Structured APIs over RDD APIs?

  • easier to write code. ( using column name rather than using index )
  • faster in performance because the plan can be optimized by Spark

When to use DataFrames & Dataset?

  • High-level APIs and DSL
  • Strong Type-safety
  • Ease-of-use & Readability
  • What-to-do
  • Structured Data Schema
  • Code optimization & Performance
  • Space efficiency with Tungsten

Foundational Spark 2.x Components

Conclusion

--

--

Life-is-short--so--enjoy-it

Gatsby Lee | Data Engineer | City Farmer | Philosopher | Lexus GX460 Owner | Overlander