Spark Lineage vs DAG

2 min readJan 16, 2023

Lineage:

Logical Plan or steps or transformations applied to different RDDS to get final RDD (final rdd means when action performed a new rdd will create). It’s a portion of DAG that leads to the creation of particular RDD.

Lineage plays Key role in spark to achieve fault tolerant behavior.

When a new RDD is derived from an existing RDD using transformation, that new RDD contains a pointer to the parent RDD and Spark keeps track of all the dependencies between these RDDs using a component called the Lineage. In case of data loss, this lineage is used to rebuild the data (fault-tolerant). DataFrame, DataSet, SQL are internally converted to RDDs for computation as RDDs are the lowest level of abstraction in Spark. So, all the transformations that are involved internally in a DataFrame, DataSet, SQL can be seen by converting them to RDD.

Here if some data loss happened while performing map operation, then it will backtrack to its parent and will overcome faults.

Note: In above picture, the nodes (RDDs) are shared. RDD1 in all 3 representations points to same object

to print the plan — rdd.toDebugString()/ df.explain()

DAG:

DAG is a physical plan that will be created only when an action is performed. It will show the way spark will execute your program.

The DAG (direct acyclic graph) is a description of how Spark would run your program; each vertex on the graph represents a distinct operation, and the edges on the graph indicate the relationships between those operations. RDD Lineage is merely a subsection of the DAG (one or more processes) that resulted in the production of that specific RDD.

Conclusion:

Lineage is a graph that describes the relation of each RDD with its parents built as a result of applying transformations to the RDD.

Within a single DAG, might create multiple RDDs and each RDD have its own Lineage.

Spark Lineage vs DAG

Lineage:

DAG:

Conclusion:

Written by Sephinreji