Understanding Spark DAGs

Vivek Chaudhary
Plumbers Of Data Science
5 min readSep 12, 2023

Directed Acyclic Graph (DAG)

A Directed Acyclic Graph (DAG) is a conceptual representation of a series of activities. The order of the activities is represented by a graph, which is visually presented as a set of circles, each circle represent an activity, connected by lines, which represent the execution flow from one activity to another. Circles are known as a vertices and connecting lines are known as edges.

Directed here means that each edge has a defined direction, that represents a unidirectional flow from one vertex to another.

Acyclic here means that there are no loops or cycles in the graph, so that for any given vertex, if we follow an edge that connects that vertex to another, there is no path in the graph to get back to that initial vertex.

Directed Acyclic Graph (DAG) in Apache Spark

DAG in Apache Spark in similar terms is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Transformations to be applied on RDD. There are finite number of vertices and edges in a spark DAG, where each edge directed from one vertex to another.

This is how Spark Application model looks like:

Application — A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

Job — A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()). During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages. So a spark job is created whenever an action is called.

Stage — Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated on the operator’s computation boundaries, where they dictate data transfer among Spark executors.

Task — A single unit of work or execution that will be sent to a Spark executor. Each stage is comprised of Spark tasks (a unit of execution), which are fed across each Spark executor, each task maps to a single partition of data.

Dive Deep into DAGs

Let’s learn more about DAGs by writing some code and understanding about the jobs, stages and tasks via SparkUI.

Create a Dataframe

df_emp=spark.read.csv(‘<path>/emp.csv’,header=True)
df_emp.show(15)

Output:

DAG Output:- Spark triggered two jobs for the above piece of code.

When inferSchema is enabled, spark reads the csv file two time., in this case InferSchema is not inovked.

DAG for job Id 0.

DAG for Job ID 0 is futher divided into three parts:

  1. Scan text
  • FileScanRDD [0] represents reading the data from a file, basically involves:- Number of Files Read, File System Read, Data Size Total, Size of Files Read Total, and Rows Output.
  • MapPartitionsRDD [1] runs the series of computations on the RDD partitions created by FileScanRDD process. In our case it is creating partitions of the data read from csv file.

2. WholeStageCodegen (1)

  • WholeStageCodegen is a physical query optimization in Spark SQL that fuses multiple physical operators together into a single Java function.
  • In simple terms, at this step, the calculations written on dataframes are computed to generate the Java Code to build underlying RDDs.

3. mapPartitionsInternal

  • MapPartitionsRDD [1] runs the series of computations on the RDD partitions created by FileScanRDD process. In our case it is creating partitions of the data read from csv file.

Detailed DAG for Stage 0:

DAG for job Id 1.

Steps are similar as Job ID 0 and has been explained above.

  1. Scan csv
  • FileScanRDD [0] represents reading the data from a file, basically involves:- Number of Files Read, File System Read, Data Size Total, Size of Files Read Total, and Rows Output.
  • MapPartitionsRDD [1] runs the series of computations on the RDD partitions created by FileScanRDD process. In our case it is creating partitions of the data read from csv file.

2. mapPartitionsInternal

  • MapPartitionsRDD [1] runs the series of computations on the RDD partitions created by FileScanRDD process. In our case it is creating partitions of the data read from csv file.

Apply group by and aggregates.

#use same dataframe to apply groupby and sum operation
sum_df=df_emp.groupBy('DEPTNO').sum('SAL')
sum_df.show(5)

Output:

Scan csv and WholeStageCodegen we have already discussed above. In groupby operation we have another operation which is Exchange. Exchange occurs when there is a Wide transformation.

Exchange

· Exchange represents Shuffle i.e., physical data movement on the cluster among several nodes.

· Exchange is one of the most expensive operations in a spark job.

Stage6 is skipped because the evaluations are being performed in stage5 above.

Under the hood spark has a two-stage plan to execute the group by operation.

· Step1 reads the data from dataset and fills it to the output exchange, where in our DAG represents Exchange box in stage5 or 6.

· Step2 reads the data from output exchange and brings it to the input exchange for stage1 to process, represented by AQEShuffleRead in Stage7.

The process of input and output exchange is known as Shuffle sort.

Summary

· Understanding about DAGs.

· Coding and SparkUI deep dive into DAGs.

--

--

Vivek Chaudhary
Plumbers Of Data Science

Aspiring Full Stack Web Developer, Full-time Data Engineer, Blogger by choice.