How to deduplicate rows in a BigQuery table
Duplicate data sometimes can cause wrong aggregates or results. You probably need to remove those duplicate rows before doing any aggregation, join or calculation. There are various ways to deal with duplicate data and you can find one of these methods to deal with the duplicate keys/columns/rows in this post .
Case-1; Get only one unique row per a key column
Example data:
Here you see there are some duplicates of full rows, you can simply use ‘distinct *’ if full row is duplicated:
select distinct * from ds.duplicates
Below query can be used to deduplicate this table based on the PK column:
As you may have recognised we are creating a new table rather than overwriting the existing table. If you want to overwrite existing table, you can use the same table name with create or replace:
You may have some additional columns in the table such as “load_date”, case-2 and case-3 shows how to handle those with same methodology.
Case-2; Deduplicate with a criteria such as keep newest
If you have some criteria for the removal, such as a date column, you can change the ordering key like:
Case 3; Deduplicate based on multiple columns
If you have multiple column keys or remove duplicates based on some columns you can use those columns within PARTITION BY clause:
Example: deduplicate based on Firstname and Lastname
Please let me know your preferred method by commenting in this post.