BigQuery Deduplication

Published in

Google Cloud - Community

3 min readSep 10, 2019

Do you ever get duplicate rows in BigQuery?

I’m going to explore some techniques for deduplication in BigQuery both for the whole table and by partition. It assumes you have the dataset transactions.

Create Your Test Data

This will create a table containing two columns (date, v). There will be 21 days of data (partitioned by date), and each day will have a million rows. The value v will be mostly unique, but there will be a number of duplicates.

Create the Data

CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT
  date,
  CAST((i*2)+(2-FLOOR(RAND()*4.0)) AS INT64) AS v
FROM
  UNNEST(GENERATE_DATE_ARRAY(
    CURRENT_DATE(),
    DATE_ADD(CURRENT_DATE(), INTERVAL 20 DAY)
  )) AS date
  CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000000, 1)) AS i;

Examine your Data

SELECT
  date,
  COUNT(DISTINCT v) AS unique_values,
  COUNT(*) AS value
FROM
  `transactions.testdata`
GROUP BY
  1;

These are the first ten rows of output.

You can see that (depending on RAND()) there are quite a few unique_values, but always less than a full million. We want to remove the duplicates.

Deduplicate Everything — Easy Way

The easiest way is to re-create the whole table in place using DISTINCT. You need to use the same parameters (PARTITION BY) for the table.

Recreate the Table

CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;

This processed about 320MB and consumed 2 minutes 15 seconds of slot time.

Examine your Data

Running the query to examine the data shows it is now deduplicated:

However, if you have partitions and you’re appending to your table you don’t want to deduplicate the entire table every time. Rather, you want to de-duplicate just your partition. But how?

Deduplicate Some of the Table — Merge

Rerun the query to create the table again so that it is full of duplicates. This way you can see the partial de-duplication.

Deduplicate using MERGE

MERGE `transactions.testdata` t
USING (
  SELECT DISTINCT *
  FROM `transactions.testdata`
  WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

This processed 15.3MB and consumed 18 seconds of slot time. As there is 21 days of data, this makes sense.