Pay attention to this when UNNESTing in BigQuery

Constantin Lungu
Data Engineer’s Notes
2 min readOct 19, 2023
Photo by Pierre Bamin on Unsplash

Here’s a common confusion that I encounter while working with ARRAYs in BigQuery. This can lead to wrong queries in some situations. Consider the following two snippets:

FROM table, 
UNNEST(repeated_column) AS single_item

versus

FROM table 
LEFT JOIN UNNEST(repeated_column) AS single_item

They are very different.

In the first, the comma ‘,’ essentially behaves like a CROSS JOIN, pairing each row of the table with every element of the array in its repeated_column.

In the second, we’re using a LEFT JOIN, ensuring retention of all rows in the table, even those without a value in the repeated_column.

How is this important, you’ll ask. Consider the following case.

Imagine you’d need to compute the number of unique banks a person is a customer of in the example below.

A person can be a customer of one, multiple or no bank at all. This becomes important if a customer has no accounts, using UNNEST joined with the comma (CROSS JOIN) would exclude these entries.

SELECT 

person_id,
COUNT(DISTINCT bank_account.bank_name) AS count_distinct_banks

FROM input_data,
UNNEST(bank_accounts) AS bank_account

GROUP BY person_id

If we want to keep these entries, we’d need to use LEFT JOIN.

SELECT 

person_id,
COUNT(DISTINCT bank_account.bank_name) AS count_distinct_banks

FROM input_data
LEFT JOIN UNNEST(bank_accounts) AS bank_account

GROUP BY person_id

In this case, the first approach would miss out on the people without a bank account, while the second would include them. Quite a big difference in query results if you ask me.

Have fun writing good SQL!

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

Originally published at https://datawise.dev.

--

--

Constantin Lungu
Data Engineer’s Notes

Software Engineer with a focus on Analytics — Data Engineering, Data Science & Business Intelligence | Opinions are my own