Data Cleansing Series : Duplication

Ahmad Mizan Nur Haq
Data And Beyond
Published in
2 min readJul 12, 2023

In this post we will learn how to handle duplication in dataset.

Duplicate data is a common problem in data sets. It can arise from a variety of sources, such as merging different data sets, entering data manually, or errors in data collection. Duplicate data can have a number of negative consequences, including:

  • Biasing the results of data analysis
  • Making it difficult to identify trends and patterns in the data
  • Increasing the size of the data set, which can make it more difficult to manage and analyze

Types of Duplication

  • Exact Duplication: Duplication that occurs when all attributes or variables in the data are identical to other data in the dataset. This type of duplication is the easiest to identify because the data is exactly the same.
  • Partial Duplication: Partial duplication occurs when some attributes or variables in the data are identical to other data, but not all of them. This type may require further analysis or comparison of specific variables to identify and resolve the duplication.
  • Near-Duplication: Near-duplication occurs when data or observations are very similar but not identical. This can happen if there are minor variations or errors in data entry, resulting in almost identical data. Near-duplication may require techniques such as fuzzy matching or similarity measures to identify and resolve duplication.
  • Subset Duplication: Subset duplication occurs when a subset of variables in data is identical to other data. In this case, the duplication is based on a specific subset of variables rather than the entire data.
  • Temporal Duplication: Temporal duplication occurs when there are multiple similar data or observations for the same entity at different times. This can happen if the dataset includes repeated measurements or updates for a particular entity.

In practical — Let’s Cook it 🍳

In conclusion, handling duplication in a dataset is an important step in data cleaning and preprocessing. Python provides several methods to handle duplication effectively, including dropping exact duplicate rows, subset duplication, keeping the first or last occurrence of duplicates.By using these techniques, you can ensure the integrity and quality of your dataset, eliminate redundant information, and avoid misleading or biased results in your analysis.

check the full code 👇

Hope you enjoy the content and hit the follow and subscribe button if you think the writer deserve it 👋

--

--