Spark Transformation and Action: A Deep Dive

Misbah Uddin
CodeX
Published in
5 min readMay 8, 2021

--

A deep dive in Spark transformation and action is essential for writing effective spark code. This article provides a brief overview of Spark's transformation and action.

Primer

For simplicity, this article focuses on PySpark and DataFrame API. The concepts are applied similarly to other languages in the Spark framework. Furthermore, it is necessary to understand the following concepts to grasp the rest of the material easily.

Resilient Distributed Dataset: Spark jobs are typically executed against Resilient Distributed Dataset (RDD), which is fault-tolerant partitions of records that can be concurrently operated. RDDs are immutable, which means each instance of an RDD cannot be altered once it is instantiated.

DataFrame: A Spark data structure conceptually equivalent to a table in a relational database or a Pandas DataFrame, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, e.g., CSV, parquet, etc., tables in Hive, external databases, or existing RDDs. A DataFrame is created and manipulated using DataFrame API. Table 1 represents a sample DataFrame.

Transformations and Actions

Common Spark jobs are created using operations in DataFrame API. These operations are either transformations or actions.

Transformation: A Spark operation that reads a DataFrame, manipulates some of the…

--

--

Misbah Uddin
CodeX

Group Product Manager: AI, Analytics and Data @H&M. Opening little boxes, one at a time