Member-only story
Boosting Spark Union Operator Performance: Optimization Tips for Improved Query Speed
Demystify Spark Performance in Union Operator
The union operator is one of the set operators to merge two input data frames into one. Union is a convenient operation in Apache Spark for combining rows with the same order of columns. One frequently used case is applying different transformations and then unioning them together.
The ways of using the union operation in Spark are often discussed widely. However, a hidden fact that has been less discussed is the performance caveat associated with the union operator. If we didn’t understand the caveat of the union operator in Spark, we might fall into the trap of doubling the execution time to get the result.
We will focus on the Apache Spark DataFrame union operator in this story with examples, show you the physical query plan, and share techniques for optimization in this story.
Union Operator 101 in Spark
Like Relational Database (RDBMS) SQL, the union is a direct way to combine rows. One important thing to note when dealing with a union operator is to ensure rows follow the same structure:
- The number of columns should be identical. The union operation won’t silently work or fill…