How spark executes the code written in Structured API(dataframes, datasets, and SQL)

Harshit Dawar
Analytics Vidhya
Published in
3 min readApr 3, 2020

--

Code written in structured API’s is executed in the following way:

  • First of all, code which includes either data frame, dataset, or SQL code is submitted to spark through the console or by submitted job.
  • Then if the code is correct, it is converted into logical plan.

Logical planning has no connection to executors or drivers, it is just a set of abstract transformations. It is aimed to convert the user’s given expression into most optimized version. It is done by converting the user code into unresolved logical plan. If there are any references in the user expression, which the spark analyzer cannot resolve, then it will reject the unresolved logical plan. If analyzer pass the unresolved logical plan (become resolved logical plan), then it is further passed to the catalyst optimizer. This will optimize the logical plan by pushing down predicates or selections and provide the optimized logical plan.

  • Now, the physical planning process begins, physical plan is also known as Spark Plan.

Spark Plan specifies how the obtained logical plan will be executed on the cluster. It does that by generating different physical execution strategies and comparing them with the help of a cost model.

Example of comparison1: How to perform a join, by viewing the physical attributes of a table.

Example of comparison2: Finding the best execution order in which the query should be executed which will give the optimized result. For example, let’s say in a query we have to filter out some candidates’ information by joining the tables. So, the best physical plan would be, first of all filter the candidates, then join tables, and display the result. Although there exists another approach that is joining the tables first, then filtering and displaying the result, which would be expensive in terms of resources.

Physical planning gives a series of RDD’s and transformation. That is why Spark is referred to as compiler, because it takes Data-frames, datasets and SQL code, and compiles them into RDD transformations.

  • Finally, execution of the physical plan is done. At runtime, spark does further optimizations, by generating native Java bytecode that has the ability to remove entire tasks or stages during execution based on several conditions. Now, at last output is returned to the user.

This is all about spark execution of the code written in structured API’s. Hope it is clear, crisp, and easily understandable.

--

--

Harshit Dawar
Analytics Vidhya

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.