SQL 101: Union vs Union All
A common operator found in SQL reports and analytics is the UNION operator. UNION (and UNION ALL) are used to combine the results of two or more SELECT statements.
UNION
Union allows you to combine the results of two (or more) select statements into a single result data set. Union removes any duplicates in the two queries. Imagine taking the results of query 1, the results of query 2, combining them into a data set and then taking DISTINCT records from this set.
Each of the select statements must have the same number of fields, and the fields across the two select statements must have compatible data types. To summarize the union operation combines two more queries and removes the duplicates.
For example, let’s take the the tables Actresses and Musicians and write a UNION statement.
SELECT
actress_name AS artist
FROM actressUNIONSELECT
musician_name AS artist
FROM musicians
The output of the above will be all of the actresses and all of the musicians, with duplicates removed. Take note that Jennifer Lopez only appears once.
UNION ALL
Union All allows you to combine the results of two (or more) select statements into a single result data set. Union All DOES NOT remove any duplicates in the two queries. Imagine taking the results of query 1, the results of query 2, combining them into a data set, if a record existed in both query 1 and query two it would occur twice in the result set.
For example, let’s take the same two tables Actresses and Musicians and write a UNION ALL statement.
SELECT
actress_name AS artist
FROM actressUNION ALLSELECT
musician_name AS artist
FROM musicians
The output of the above will be all of the actresses and all of the musicians, but instead of removing duplicates, duplicates remain. Take note that Jennifer Lopez only two times in the result set when using UNION ALL.
The Difference between UNION and UNION ALL
- UNION removes duplicate rows
- UNION ALL does not remove duplicate rows
Practical Examples
What should you do if you have two queries where the result data sets have a different number of columns? Since UNION requires the two queries to have the same number of columns, you can’t just combine the two results. You can use NULL or a hard coded value to populate the missing columns and ensure that both queries have the same number of columns.
We can use the below query to combine actresses and the number of films they have performed in and musicians and the number of films they have performed in.
SELECT
actress_name AS artist,
number_of_films
FROM actressesUNIONSELECT
musician_name AS artist,
0 AS number_of_films
FROM musicians
One issue we are going to run into is Jennifer Lopez. Because UNION removes duplicates you might assume that we will only see JLo once in our result set, but because UNION removes duplicates across all columns we will see two records for JLo, one with a film count of 38, and one with a film count of 0.
We can solve this issue by adding a little logic on to our query as seen below. By adding an outer query, grouping by artist and summing the number_of_films we can reduce the two JLo records down to one.
SELECT
artist,
SUM(number_of_films) AS number_of_films
FROM (
SELECT
actress_name AS artist,
number_of_films
FROM actresses UNION SELECT
musician_name AS artist,
0 AS number_of_films
FROM musicians
)
GROUP BY artist
Tip: be careful when using the UNION
operation on the same table as it can cause the query to read the same table multiple time. Instead of UNION
consider refactoring the query iso that all conditions are placed in one SELECT
instruction, or using an OUTER JOIN
instead of UNION
.
I have also covered all of the above in this helpful video tutorial:
You can crush the data engineering interview by learning the tips and tricks in my book, Ace The Data Engineering Interview, on kindle and paperback, and my free companion app on iOS.