Spark SQL under the hood — part I

Summary & initial requirements

If you use and have a basic understanding of the core concepts of the Apache Spark and Spark SQL (RDDs, DataFrames, Execution Plan, Jobs & Stages & Tasks, Scheduling), then after reading this blog post you should be able to answer the following questions:

  • How do Spark SQL’s Datasets relate to RDDs?
  • What does in-depth execution of operations on Datasets look like?
  • How to debug and verify jobs containing operations on Datasets?

This blog post is based on Apache Spark version 2.1.1.
The code is available on Github

Spark SQL

In recent years Apache Spark has received a lot of hype in the BigData community. It is seen as a silver bullet for all problems related to gathering, processing and analysing massive datasets. The same simple API and infrastructure can be easily used by analysts to quickly verify new Machine Learning algorithms (by using for example Zeppelin notebooks) and to run heavy-duty, ETL-like jobs on production clusters.

One of the main (and presumably most complex) components of Spark is Spark SQL, which is used to perform SQL-like queries on structured data. Due to its rapid evolution (do not forget that Spark is one the most active open source projects), some of the ideas behind it seem to be unclear and require digging into different blog posts and presentations. One such idea is the concept of essential Spark abstractions: if you started learning Spark in the old days you mostly heard about RDDs as a building blocks for all your applications. Later, with the introduction of DataFrames and, most recently, Datasets, there seems to be a shift towards these new concepts (in almost all Databricks’ presentations there is a note to use them wherever possible; rewriting the whole ML library to use the DataFrame API seems to prove that point as well).

There are some great blog posts on the web explaining how these concepts relate to each other from a high-level perspective (please check the links section); however the notion of Dataset is still fuzzy and is defined as a replacement for RDD for most operations. There is a ton of information about code generation in Catalyst or off-heap memory management included in the Tungsten project, but the actual technical relation of DataFrames/Datasets to RDDs is usually just reduced to “DataFrames use RDDs internally”.

In this blog post we are going to take a deeper look at this relation and verify our understanding by debugging a simple application using different mechanisms provided by Spark and Spark SQL. We focus on each step: from defining Dataset operations in code to the actual execution in a given environment.

Core abstractions


We will not focus on the details of RDDs, because this topic is covered in most Spark tutorials. For the sake of this blog post you just need to remember that RDDs are immutable and have a lineage. When we transform one RDD, using for example a map operation, the new RDD will contain a dependency on the first one. This can be checked by invoking the toDebugString or dependencies methods on RDD as shown below.

This lineage feature is crucial for execution of a Spark application. The Spark engine is able to generate a graph of computations consisting of Tasks (can be run in parallel) and group them into Stages (requires shuffling between nodes) based on these dependencies.

Datasets and DataFrames

We will not get into the details of DataFrames as they are the old API for Spark SQL and were replaced by Datasets. You can think of them just as an untyped version of Datasets (actually DataFrame is just a type alias forDataset[Row])
Dataset API was introduced in Spark 1.6 as a part of Spark SQL and provides type safety of RDDs along with performance of DataFrames. How can we use it?
We just need to add Spark SQL to project dependencies, define a type that we will operate on and invoke any lambda function we like to transform the data.
Here is a simple scenario in which we load a file with each line containing a person’s details in JSON format then we filter out young users and save the result to disk. Here is a file with some peoples’ details:

We need to define a type that we will use in our code to operate on people’s details. It will be a simple case class:

The next step is to read the file using the JSON file reader provided by Spark (which is based on the Jacksonlibrary)

As you can see, the peopleDataset value contains information about Person type so the compiler is able to verify if all required fields for further operations are defined and have correct types. For example we can filter users by age:

When using the Dataset API, we can also save the results to disk (the default file format is Parquet):

But wait a second! We just created a fully functional Spark application which reads a file, transforms the data and saves it to the disk, and we did not touch the RDD API. How is that possible? Let us find out using an experimental approach.

Diving into Spark SQL

Before we get changed into a swimming suit, let us first recall how Datasets were supposed to work.
First of all, based on the operations we have defined, a logical plan of a query is created which defines what operations should be performed to execute a given query. The next step is to do some optimizations (for example pushdown predicates or do the cost based optimization — stay tuned for a next blog post on that topic!) and generate the optimized query plan. In the next step, this optimized plan is translated to a physical plan. After that there is a code generation step which produces highly efficient code from the physical plan which is then executed. Please take a look at the diagram from Databricks’ blog post (ignore the DataFrame as the diagram was created before the introduction of Datasets, but it works in the same way):

As you can see, multiple Physical Plans are created and only one is selected to be used for code generation. This is true only for choosing a join algorithm (here is a great presentation on the topic); the complete Cost Based Optimization will be shipped with Spark 2.2

Query Execution Plans

Important note: we analyze the plans only for reading and filtering our data. Unfortunately we cannot do that for the whole processing, because saving of results (using the write method) is an Action (not a Transformation) and we do not have a way to use the Dataset API after invoking an action.
We can check what plans were generated (steps 1-2) using the explain(extended: Boolean) method available in the Dataset class.

the result is:

As you can see there are three different types of Logical Plans and a Physical Plan. Let us go briefly through each of them. For a more detailed explanation please take a look at TreeNode and QueryExecutionclasses (especially the analyzed, optimizedPlan and sparkPlan fields in the latter).

Parsed Logical Plan

This is the first step after parsing our Dataset code. Our query was reduced to two steps, the first one is defined by TypedFilter, which applies some function to all elements (in this case mapping to Person type) and filters the result.
The second just contains the Relation which represents a collection of tuples with a known schema.

Analyzed Logical Plan

After parsing the query, Catalyst analyzes it to resolve the references and produces AnalyzedLogicalPlan. This is done by the Analyzer class. Let us see what the Scaladoc says about its role:

Provides a logical query plan analyzer, which translates UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a SessionCatalog and a FunctionRegistry.

So Catalyst keeps track of all the attributes and relations and their mapping to actual typed objects and uses this mapping during analysis.
You can see that in the results above:

unresolveddeserializer(newInstance(class com.virtuslab.sparksql.MainClass$Person))

was translated to

newInstance(class com.virtuslab.sparksql.MainClass$Person)

Optimized Logical Plan

The next step is to optimize the AnalyzedLogicalPlan using different strategies. You can learn how to define and add your own optimisation rules from this blog post. You can check what rules are used by looking at the SparkOptimizer class. In this step Catalyst also checks if there is any cached data. In our case there is (we cached the peopleDataset) so the OptimizedLogicalPlan has InMemoryRelation and FileScansteps instead of Relation in the previous steps.

Physical Plan

Lastly the Physical Plan is created by applying rules defined in SparkStrategies. The API (as seen in QueryPlanner) is able to return multiple PhysicalPlans, but right now only one is generated.
In our example the TypedFilter was replaced by InMemoryTableScan and Filter.

Code generation

For all steps in a PhysicalPlan, Catalyst generates highly optimised code.
We can use

to check what was produced:

It is barely readable (not only because it is in Java), but some important elements are clearly visible. GeneratedIterator class was created and it extends BufferedRowIterator. The interesting bits are in the processNext() method: it is executed for every InternalRow (so for every element in our dataset) and this is where all of the filtering and mapping takes place.
If our query was more complex, there would be multiple classes generated. You can see that at the top of the result

Found 1 WholeStageCodegen subtrees.

was printed out so only one of them was generated.

Looking for RDDs in Spark UI

We still have not touched anything that would connect us to the old Spark world with RDDs and stuff. But we still have some tools in our debugging toolbox, especially the Spark UI available on port 4040 by default.

We see that Spark executed two jobs, each of which had one stage. Based on the line numbers we can see that the first job was related to reading the file with data and mapping it to a Person type; the second was related to filtering and saving results as a file.
Let us check what each stage consisted of:

So Stage 0 consisted of reading a file (HadoopRDD) and mapping the values in parallel (MapPartitionsRDD). We got two tasks, which means we had two partitions (this is seen also in the Tasks table, where number of records for one task was set to 2 and for the second one to 1 and our input file has 3 rows).

Stage 1 is much more complex. It consists of WholeStageCodegen step based on cached values; next we do the mapPartitionsInternal (most probably to perform filtering by age). Then we do the InMemoryTableScan to get the results and finally we have another WholeStageCodegen to save the results in a file.
If you expected that there would only be one WholeStageCodegen and nothing else, then you must feel a little bit confused. In the next blog post in this series we will go through Spark SQL's code to see how those RDDs were generated.


Summing-up, we recalled what RDDs and Datasets are, we created and analysed a simple application using the Datasets API, we went through the most important steps of the Catalyst optimizer and learned different ways to debug it and verify the relation between RDDs and Datasets at the lowest level by experimenting with different API functions and Spark UI.
In the next blog post we will use a more scientific approach to see how and when RDDs are produced for Dataset-based jobs by analysing the Spark SQL code.

Spark schema inference side note

Why does the age attribute in our Person case class need to be an Option[Long] instead of Option[Int], even though we could easily fit it in Int? It is because of the schema inference for JSON sources: by default it uses Long for both Ints and Longs (there is a similar problem with Float/Double values). We can see that in the InferSchema.inferField function:

For the sake of this post, Long is enough; however, if you need the narrowest type possible then you need to define column’s type manually. This problem will not occur when reading a CSV file and setting inferSchemaoption to true (please take a look at CSVInferSchema object)


* Great explanation of difference between RDDs, DataFrames and Datasets
* Another good explanation
* Spark’s key terms
* Deep dive into Spark SQL’s Catalyst Optimizer
* Another presentation on the same topic
* Tungsten overview
* Tungesten more in-depth overview
* Great presentation about optimizing Spark SQL joins
* Shuffle architecture
* Great presentation about Spark internals by Aaron Davidson
* A look under the hood at Apache Spark’s API and engine evolutions
* Spark SQL programming guide
* Trends for BigData and Apache Spark in 2017