DeDuplicating: SQL vs. Python

Both SQL and Python offer powerful functions to help data engineers clean data and eliminate dreaded ‘dupes’ in datasets.

A gloved hand holding a spray bottle.
Photo by JESHOOTS.COM on Unsplash

One of the most important processes a data engineer can master is deduplicating values in order to provide clean data for data consumers. Since raw data can vary in format and cleanliness it is vital that data…

--

--

--

Offering original and aggregated data engineering content for working and aspiring data professionals. Content posted here generally falls into one of three categories: Technical tutorials, industry news and visualization projects fueled by data engineering.

Recommended from Medium

How to Use Google Kubernetes API Efficiently

Credit Checker: Automated Salesforce-native credit reporting app

I received 7 offers in 10 days — Here’s how you can crush your next job interview

Recoil.js: The Current Landscape of Developer Tools

SSM - August 19 2021

Java Fundamentals Part-I

The Developer’s Narrative

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zach Quinn

Zach Quinn

DE @ Forbes. Pipeline: A Data Engineering Resource. Editor: Learning SQL. Opinions are my own.

More from Medium

#6 Data Engineering — EXTRACT DATA using APIs

Tutorials to Build Batch ETL Pipelines

ETL Three different ways

Fine dining table with wine glasses, and gourmet salmon.

ETL Procedures for Data Warehouses