BigQuery Table Comparison

Mark Scannell
Sep 30 · 3 min read

Introduction

Sometimes you need to compare data across two BigQuery tables. You may want to do this generically (match entire-row-by-row), sometimes comparing by key. This is far easier than you may think!

Sample Data

NOTE: You need the transactions dataset for this SQL.

Create a left_table with a million rows. This will contain the base table for comparison.

CREATE OR REPLACE TABLE `transactions.left_table` AS
SELECT
r AS id,
CAST(r AS STRING) AS col1,
CAST(FLOOR(RAND() * 1000) AS STRING) AS col2,
'x' AS col3,
'y' AS col4
FROM
UNNEST(GENERATE_ARRAY(1, 1000000)) r;

Create a right_table with mostly the same data, but some deletes, updates, and inserts. There is a probability used for deletes and updates, while 100 rows are simply inserted.

CREATE OR REPLACE TABLE `transactions.right_table` AS
WITH
DataChanges AS (
SELECT
* EXCEPT (col3, col4),
IF(RAND() < 0.001, 'new-x', col3) AS col3,
col4
FROM
`transactions.left_table`
WHERE
RAND() > 0.001
),
NewData AS (
SELECT
r AS id,
CAST(r AS STRING) AS col1,
CAST(FLOOR(RAND() * 1000) AS STRING) AS col2,
'x' AS col3,
'y' AS col4
FROM
UNNEST(GENERATE_ARRAY(1000000, 1000100)) r
)
SELECT * FROM DataChanges
UNION ALL
SELECT * FROM NewData;

Whole Table Comparison

A whole table comparison will generate a FARM_FINGERPRINT per row. This will allow us to do a FULL OUTER JOIN on this hash and see what values exist on one side but not the other.

Note that without a key we cannot tell if a row has been updated.

WITH
LeftData AS (
SELECT
a AS data,
FARM_FINGERPRINT(FORMAT("%T", a)) AS h
FROM
`transactions.left_table` AS a
),
RightData AS (
SELECT
b AS data,
FARM_FINGERPRINT(FORMAT("%T", b)) AS h
FROM
`transactions.right_table` AS b
)
SELECT
IF(l.h IS NULL,"New on right","New on left") AS Change,
IF(l.h IS NULL,r.data,l.data).*
FROM
LeftData l
FULL OUTER JOIN RightData r
ON l.h = r.h
WHERE
l.h IS NULL OR
r.h IS NULL

The above could be changed so that — rather than listing the rows — it could count the rows with a GROUP BY.

Key by Key Comparison

Rather than just comparing the whole table, we want to compare row-by-row. So we will use the key to determine if it’s an insert, update, or delete.

SELECT
CASE
WHEN a.id IS NULL AND b.id IS NOT NULL THEN "I"
WHEN a.id IS NOT NULL AND b.id IS NULL THEN "D"
ELSE "U"
END AS op,
IF(b.id IS NULL, a, b).*
FROM
`transactions.left_table` a
FULL OUTER JOIN `transactions.right_table` b
ON a.id = b.id
WHERE
a.id IS NULL OR
b.id IS NULL OR
FARM_FINGERPRINT(FORMAT("%T", a)) <>
FARM_FINGERPRINT(FORMAT("%T", b));

This will generate a Change-Data-Capture like log of changes between left_table and right_table. Below is the sample output from running this.

Conclusion

BigQuery can be used for creating test cases, validating your data, generating CDC logs from snapshots fairly easily. This can be done without an explicit column-by-column comparison.

Thanks to Thinh Ha for the comparison queries.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Mark Scannell

Written by

Strategic Cloud Engineer at Google Cloud

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade