BigQuery Deduplication
Do you ever get duplicate rows in BigQuery?
I’m going to explore some techniques for deduplication in BigQuery both for the whole table and by partition. It assumes you have the dataset transactions.
Create Your Test Data
This will create a table containing two columns (date, v). There will be 21 days of data (partitioned by date), and each day will have a million rows. The value v will be mostly unique, but there will be a number of duplicates.
Create the Data
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT
date,
CAST((i*2)+(2-FLOOR(RAND()*4.0)) AS INT64) AS v
FROM
UNNEST(GENERATE_DATE_ARRAY(
CURRENT_DATE(),
DATE_ADD(CURRENT_DATE(), INTERVAL 20 DAY)
)) AS date
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000000, 1)) AS i;
Examine your Data
SELECT
date,
COUNT(DISTINCT v) AS unique_values,
COUNT(*) AS value
FROM
`transactions.testdata`
GROUP BY
1;
These are the first ten rows of output.
You can see that (depending on RAND()) there are quite a few unique_values, but always less than a full million. We want to remove the duplicates.
Deduplicate Everything — Easy Way
The easiest way is to re-create the whole table in place using DISTINCT. You need to use the same parameters (PARTITION BY) for the table.
Recreate the Table
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;
This processed about 320MB and consumed 2 minutes 15 seconds of slot time.
Examine your Data
Running the query to examine the data shows it is now deduplicated:
However, if you have partitions and you’re appending to your table you don’t want to deduplicate the entire table every time. Rather, you want to de-duplicate just your partition. But how?
Deduplicate Some of the Table — Merge
Rerun the query to create the table again so that it is full of duplicates. This way you can see the partial de-duplication.
Deduplicate using MERGE
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
This processed 15.3MB and consumed 18 seconds of slot time. As there is 21 days of data, this makes sense.
Examine your Data
One of the days — the current date as of writing — is no deduplicated, but none of the other days. It is as expected!
Conclusion
It is very easy to deduplicate rows in BigQuery across the entire table or on a subset of the table, including a partitioned subset.