How to deduplicate rows in a BigQuery table

Samet Karadag
Google Cloud - Community
2 min readNov 19, 2020

Duplicate data sometimes can cause wrong aggregates or results. You probably need to remove those duplicate rows before doing any aggregation, join or calculation. There are various ways to deal with duplicate data and you can find one of these methods to deal with the duplicate keys/columns/rows in this post .

Case-1; Get only one unique row per a key column

Example data:

Here you see there are some duplicates of full rows, you can simply use ‘distinct *’ if full row is duplicated:

select distinct * from ds.duplicates

Below query can be used to deduplicate this table based on the PK column:

As you may have recognised we are creating a new table rather than overwriting the existing table. If you want to overwrite existing table, you can use the same table name with create or replace:

You may have some additional columns in the table such as “load_date”, case-2 and case-3 shows how to handle those with same methodology.

Case-2; Deduplicate with a criteria such as keep newest

If you have some criteria for the removal, such as a date column, you can change the ordering key like:

Case 3; Deduplicate based on multiple columns

If you have multiple column keys or remove duplicates based on some columns you can use those columns within PARTITION BY clause:

Example: deduplicate based on Firstname and Lastname

Please let me know your preferred method by commenting in this post.

--

--