Member-only story
How to Build a Unique MD5 Row Hash Using SQL in BigQuery (Plus a Few Related Things)
Using native BigQuery functionality to generate a dynamic, unique row identifier in SQL
Motivation
There are a number of different reasons you might want to build a hash of each row of data in a BigQuery dataset, but they are mostly related to the fact that there is no enforced primary key in BigQuery.
This means that your data might contain duplicate rows, whether it’s from upstream data collection or transmission errors, misconfigured data ingestion tools, or simply a result of accidentally loading something twice.
Or something else. Who knows?
Whatever the source of duplicates, you should always try to understand what they are and probably filter them out (as well as identify the root cause, if you care about that kind of thing… FYI you probably should).
If you do have duplicate rows in your data, as soon as you start applying more complex transformations like joins, you risk amplifying the errors and invalidating downstream processes, models and decisions, costing hard-earned trust and potentially causing other unpredictable, unwanted outcomes.