DeDuplicating: SQL vs. Python
Both SQL and Python offer powerful functions to help data engineers clean data and eliminate dreaded ‘dupes’ in datasets.
One of the most important processes a data engineer can master is deduplicating values in order to provide clean data for data consumers. Since raw data can vary in format and cleanliness it is vital that data engineers take steps to automate the cleaning of data in ETL pipelines to rid the dataset of a hidden pest: The duplicate value. In some cases it’s not necessary to filter out duplicates. However, when your job is to facilitate the smooth transfer of data to organizational stakeholders, you’ll want to eliminate any ambiguity in your data.
Dupe(d)
A duplicate or ‘dupe’ (shorthand used by many engineers and data consumers) can be deceitful because most functions in SQL and Python won’t flag these values as duplicates since they might have information that differs between columns.
Filtering Duplicates in SQL
For this walkthrough, I created a quick dummy dataset in BigQuery to illustrate what a duplicate value is, why they can be deceptive and why, in many cases, they should be filtered.