Authors: Dilip Biswal, Miao Wang, Douglas Paton
We use Apache Spark in Adobe Experience Platform as a way of helping us process and stream data in the ecosystem. It acts as a unified analytics engine that makes it possible to access and transform massive amounts of data with ease.
Spark provides many interfaces, including SQP, to access and transform data at scale. The current version of Spark allows users to view the query plan within SQL using an EXPLAIN statement. This provides users with three options when viewing the SQL plan:
- View the parsed plan.
- View an optimized plan.
- View the physical plan.
Options 1 and 2 were generally very useful in the development stage of an application, while option 3 was helpful in both the development and production stages. The problem was that when trying to look at the physical plan using the EXPLAIN output, things could get a little cluttered. For example, when using the explained format for the query statement in Figure 1, users would get something that looked like Figure 2.
As you can see, Figure 2 isn’t particularly easy to read. Each operator is output into a single line (the bottom line in Figure 2) and, should there be a large number of operators, this becomes very hard to follow. Not only that, but subqueries (Figure 3) get displayed as part of the main plan and this becomes incredibly complex and time-consuming to read (as seen in Figure 4).
The more complex the SQL execution plan, the harder the Explain output was to read, especially when numerous subqueries came were required. The Explain output was complicated by the fact that the string representation of each operator could end up being very wide and wraps around in the display, usually the result of underlying query operating on a wide table or having complex expressions.
Subqueries get printed as part of the main plan. When multiple subqueries originate from an enclosing operator, the output plan is confusing and hard to follow.
As a result of conversations we had with customers, we decided to try and improve the output of the Explain format to reduce the cluttered look and make it easier to read.
Decluttering the EXPLAIN output
In order to achieve our goal, we decided to separate the output into two distinct sections, a header and a footer section.
The header contains the basic operating tree for the SQL execution plan. Within that plan, the various operators are clearly identified using operator IDs, like this: (1). We included as little information as possible in the header to keep it simple and easy to follow.
In the footer, each operator is again identified using the operator ID and is then followed by additional attributes of the operator. The format information of the operator is not restricted to one line. This lets us format this section in a more user-friendly way.
As you can see in Figure 5, the resulting output is significantly easier to follow.
By separating the output into a header section and a footer section, we eliminated the massive single line of text that identified the operators. Even when a large number of operators are being used, the EXPLAIN output keeps everything organized in a way that is easy for anyone to follow, especially when compared to the single line output from Figure 2.
Best of all, when there are uncorrelated subqueries in the plan, it becomes very easy to follow the plan thanks to the new output, as seen in Figure 6. All the subqueries trackback to the parent using the operator ID.
In both instances, the structure of the plan can be identified significantly easier with the new output than before.
Continuing the decluttering process
Now that we’ve got the basic infrastructure in place, we can start looking into other areas where we can clean up within Spark. At this stage, we can only look at the print input for the plan. And, we currently can only access the data through the traditional SQL interface. We are in the process of working on a way to allow us access to Dataset and Dataframe APIs, as well. We also hope to expand beyond SQL to other supported languages, such as Scala, Python, and R.