DeDuplicating: SQL vs. Python

Both SQL and Python offer powerful functions to help data engineers clean data and eliminate dreaded ‘dupes’ in datasets.

A gloved hand holding a spray bottle.
Photo by JESHOOTS.COM on Unsplash

One of the most important processes a data engineer can master is deduplicating values in order to provide clean data for data consumers. Since raw data can vary in format and cleanliness it is vital that data…




Offering original and aggregated data engineering content for working and aspiring data professionals. Content posted here generally falls into one of three categories: Technical tutorials, industry news and visualization projects fueled by data engineering.

Recommended from Medium

How to Use Google Kubernetes API Efficiently

Credit Checker: Automated Salesforce-native credit reporting app

I received 7 offers in 10 days — Here’s how you can crush your next job interview

Recoil.js: The Current Landscape of Developer Tools

SSM - August 19 2021

Java Fundamentals Part-I

The Developer’s Narrative

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zach Quinn

Zach Quinn

DE @ Forbes. Pipeline: A Data Engineering Resource. Editor: Learning SQL. Opinions are my own.

More from Medium

#6 Data Engineering — EXTRACT DATA using APIs

Tutorials to Build Batch ETL Pipelines

ETL Three different ways

Fine dining table with wine glasses, and gourmet salmon.

ETL Procedures for Data Warehouses