Data Deduplication

Published in

Engineering @ Housing/Proptiger/Makaan

4 min readDec 26, 2022

Data Deduplication or data dedup is a process that eliminates or marks duplicate excessive copies of data.

It analyses and identifies data that is stored multiple times. When duplicates are found, information from multiple copies can be merged to create one single record or simply can make a connection between duplicate records.

Example of deduplication

A typical email system might contain 100 instances of the same 1 megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is stored; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB storage demand drops to 1 MB.

Problems with redundant data

It increases the chance of mistakes and confusion. For example, multiple sales representatives contacting the same customer through different contact info.
It increases the chances of fraud activities. For example, same user getting benefits of One-Time offers by using multiple accounts.
It leads to high data storage costs because of duplicate data.
It makes data recovery more difficult due to inconsistency in the data.

Techniques to deduplicate data

There are two main techniques to deduplicate redundant data — inline and post-processing deduplication.

Inline deduplication — Deduplication occurs at the time of creation or modification of data.
Post-processing deduplication — Deduplication occurs as a background activity that is implemented to run on-demand or at defined intervals.

Advantages of deduplication

Data deduplication plays an important role in an effective data management strategy. Below are some benefits of data deduplication:

Improve bandwidth and recovery efficiency — With less duplicate data to drag them down, the system will run faster and the team will operate more efficiently. And if you ever need to perform a recovery, the data transfer will complete in less time since you’ll only be restoring unique, quality data and no duplicate files.
Improve sales and marketing campaigns — Deduplication helps to increase the accuracy of your organisation’s insights, which means you have better information to base your strategies on.
Decrease fraud activities — Deduplication helps to decrease the number of fraud activities by identifying users with multiple accounts at the time of account verification. Also, we can prevent undue advantages taken by the customers.
Decrease data storage costs — Storing large volumes of data costs a lot of money. Deduplicating your database prevents it from being bloated with redundant records that needlessly drive up data storage costs.
Data verification costs — When it comes to data verification, it is always a best practice to deduplicate data first to prevent paying for the same record to be verified multiple times.
Increase return on investment — Implementation of effective data deduplication will always have a high return on investment (ROI) for business — right from the start. By eliminating redundant data, you’ll be able to decrease your overall storage costs, data verification costs, and marketing costs on direct mail or call campaigns.

Uses of deduplication at Housing

At Housing, we use deduplication to make connections between duplicate accounts. With the help of this connection, we merge all information of duplicate accounts to provide best customer experiences and suggestions for properties or packages. We use deduplication for multiple purposes like for identifying duplicate listings, duplicate accounts, etc.

We have customers who deal in different cities and for each city they may create different accounts. So, deduplication helps us to understand our customers better in terms of their needs, expectations and requirements. It also helps our sales and marketing representatives to deal with our customers in a more efficient manner. By the help of this connection between different duplicate accounts, we are also able to offer best deals to our customers.

Flow chart — working of deduplication — Inline deduplication at housing

Let’s take an example of deduplication of user’s accounts. Here, account, contact and legal entity tables contain various fields like name, email, phone, Aadhaar number, GST number, etc. Deduplication logic is based on email, phone, Aadhaar number, GST number, IP address, etc. As shown in above flow chart, we use inline deduplication techniques to identify duplicate accounts at the time of creation or updation of account, contact and legal entity. If this created or updated record matches with any existing records then our deduplication logic will link these matching records.

Let’s understand the working of deduplication through the above figure, we have four accounts with their name, email and phone. Here, the first and second account have the same email, and the first and third account have the same phone number. So, our deduplication logic makes a connection between the first and second account on the basis of duplicate email, and connection between first and third account on the basis of a duplicate phone. Fourth account will be independent.

Conclusion

Data deduplication carries a multitude of benefits including improving sales and marketing campaigns, customer engagement, and overall ROI. It also helps you save storage space, speed up your systems, and run your operations more smoothly and with less risk of error. Removing duplicated data should be an ongoing process in ensuring data protection and quality.