Combining Spark DataFrames with Union and Union All
A Beginner’s Guide to Combining Data in Spark
In this article, we’re going to learn about Union and UnionAll with examples. They are operations in Spark SQL that combine two or more DataFrames into a single DataFrame.
“Union” combines the DataFrames, eliminating any duplicate rows, while “Union All” combines the DataFrames, including all rows, including duplicates.
In summary:
Union: returns a new DataFrame with unique rows from the input DataFrames.
Union All: returns a new DataFrame with all rows from the input DataFrames, including duplicates.
unionAll()
is deprecated since Spark “2.0.0” version and replaced withunion()
.
Combining two DataFrames in PySpark using `union()`
Here’s an example of using the “union” operation to combine two Spark DataFrames in PySpark:
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("DataFrame Union Example").getOrCreate()
# Create two sample DataFrames
df1 = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Jim", 35)], ["id", "name", "age"])
df2 = spark.createDataFrame([(4, "Jerry", 40), (2, "Jane", 30), (5, "Jill", 45)], ["id", "name", "age"])
# Use union operation to combine the two DataFrames
result_df = df1.union(df2)
# Show the resulting DataFrame
result_df.show()
The output of the above code will be:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| John| 25|
| 2| Jane| 30|
| 3| Jim| 35|
| 4|Jerry| 40|
| 5| Jill| 45|
+---+-----+---+
As you can see, the resulting DataFrame contains unique rows from both input DataFrames.
Combining two DataFrames in PySpark using `unionAll()`
DataFrame unionAll()
method is deprecated since Spark “2.0.0” version and recommends using the union() method.
df3 = df1.unionAll(df2)
df3.show()
It will return the same output as union()
the output above.
Combining two DataFrames without Duplicates using `union()` and `distinct()`
You can combine two or more Spark DataFrames without duplicates by using the “distinct” operation in Spark SQL. The “distinct” operation removes duplicates from the input DataFrames and returns a new DataFrame with unique rows.
Here’s an example of using the “distinct” operation to combine two Spark DataFrames in PySpark:
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("DataFrame Distinct Example").getOrCreate()
# Create two sample DataFrames
df1 = spark.createDataFrame([(1, "John", 25), (2, "Jane", 30), (3, "Jim", 35)], ["id", "name", "age"])
df2 = spark.createDataFrame([(4, "Jerry", 40), (2, "Jane", 30), (5, "Jill", 45)], ["id", "name", "age"])
# Combine the two DataFrames
combined_df = df1.union(df2)
# Remove duplicates using the distinct operation
result_df = combined_df.distinct()
# Show the resulting DataFrame
result_df.show()
The output of the above code will be:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 5| Jill| 45|
| 2| Jane| 30|
| 3| Jim| 35|
| 1| John| 25|
| 4|Jerry| 40|
+---+-----+---+
As you can see, the resulting DataFrame contains unique rows from both input DataFrames and any duplicates are removed.
Conclusion
In this article, you have learned how to combine two or more spark DataFrame’s of the same schema into a single DataFrame using the Union method and learned the difference between the union() and unionAll() functions.