In-Depth Analysis.

Google sheets remove duplicates with Fuzzy matching

Fuzzy Duplicates: 10 Advanced Ways to Identify & Deduplicate Customer Master Data for google sheets

Bena Brin

Published in

Analytics Vidhya

10 min readOct 3, 2021

Download remove duplicates for google sheets tools based on text similarity

If you’re in charge of client data management, you’ve almost certainly dealt with the problems that duplicate data can cause. Whether the duplicate data entered your system as a result of clients filling out forms, your team manually inputting the data, or imports from other platforms, the implications are the same, and they’re expensive.

In fact, the costs of duplicating data are much larger than you might think. Every year, data integrity issues cost American firms more than $600 billion. Duplicate contacts, firms, and deals in your CRM could be the data issue most directly linked to those data-quality costs. They are detrimental to customer relationships. They’re found in almost every CRM database, and their impact on your marketing, sales, and support efforts is typically obvious.

Duplicates have a significant impact on sales teams. Reps are compelled to change their usual sales processes to include checks for duplicates in databases with high duplicate rates, or risk engaging prospects and accounts without crucial context, causing client relationships to suffer.

They wreak havoc on your marketing automation by generating embarrassing gaffes that tarnish your brand’s image and waste your marketing budget.

Bad data is found in 40% of leads. Fixing those flaws creates a significant opportunity for development, as 33% of organizations have more than 100,000 client records in their CRM.

Duplicate contact records might also make it difficult to provide a satisfying customer support experience. When a customer contacts you via phone, email, or live chat, your service will be slower and less effective if they must go through many customer records to find the right profile. It is vital to their employment that they have quick access to the customer data.

Anyone who has done a significant amount of duplicate data cleaning knows that using basic exact match values to find duplicates leaves a lot of meat on the bone. It’s possible that you’re leaving most duplicates in your database.

You must delve deeper into data deduplication in your CRM to properly master it.

When you examine beyond the obvious exact-match duplicates in a CRM database, you’ll notice that there are many more that fall outside of the obvious exact-match duplicates, where the waters are murkier.

These fewer common duplicates scenarios are far more common than most people realize, and they must be considered if you want to eliminate duplicates from your database.

We’ll go over some of the more advanced forms of duplicate records that you’re likely to find in your CRM databases in this article.

Deduplication of Customer Data Table of contents

1.Various Expressions of Common Terms

Prevalent phrases being presented in different ways is one of the most common ways for duplicate client data to go undetected in a database.

Let’s look at a few examples. Let’s imagine you’re running a contact data deduplication process in HubSpot and one of the keyways to match duplicate entries in your database is by utilizing a company name.

In separate customer records that are truly duplicates, the firm name may be expressed differently.

Consider the following example:

Though the firm name is spelled differently, you’re more likely to overlook duplicate records, even if the fact that they’re redundant data is evident.

Let’s look at another example: job titles

It is for this reason why data standardization is so important. Otherwise, it’s very impossible to find duplicate customer data. If you don’t have standardized processes in place, you’ll almost certainly have duplicate records in your CRM.

2. Nicknames and short names

People are frequently referred to by many names. They may go by a nickname or initials, or they may use a shorter, more casual version of their first name.

If a man’s name was Jonathan Paul Johnson, for example, you might see his name written in a variety of ways across several duplicate CRM contact records:

He might also go by a nickname like “Bud,” “Junior,” or something else entirely. It would be quite easy to miss the duplicate record in any of these scenarios using standard duplication detection algorithms.

3. Typos

When humans oversee data entry, there will always be typos. If you have client or employee-facing forms (i.e., you don’t collect all data by automated means), you can bet you have duplicate data in your database that is missing your checks due to typos.

The average rate of human data entering mistake is 1%. That means that one out of every hundred keystrokes will be incorrect.

Any field that relies on human input can encounter problems, especially in larger customer datasets. Due to these concerns, locating duplicate customer data is challenging.

4. Suffixes and Titles

Contact data with a suffix title can potentially cause you to miss duplicate records in your client database that would otherwise be clear.

You might have duplicate records that look like: Using Jonathan Johnson as an example, you might have duplicate records that look like:

•Jonathan Johnson, Ph.D.
•Jon Johnson, M.D.
•Mr. Jonathan Johnson is a lawyer.
•Jonathan Johnson Jr. is a young man that has a lot of
•Jonathan Johnson III (Jonathan Johnson III)
•Jonathan Johnson, Attorney at Law

No matter where the data originated from — whether it was entered by the person themselves or acquired from a third-party list — title and suffix are important factors.

5. Considerations for Website URLs

For organizations within a CRM, using a website URL to discover duplicate records is popular. The field between two customer entries may or may not include the “www.” or “http://” in the URL, resulting in duplicate records being missed.

Alternatively, various top-level domains may be used for different customer records. For example, compare microsoft.com vs microsoft.co.uk. Subdomains are another common reason for duplicate data being missed. For example, a university might have several departments, each of which leads to a different domain path — math.school.edu, english.school.edu, physics.school.edu, and so on.

To guarantee that your database is free of potential concerns, evaluate all of these website URL factors.

6. Similarity-based matching (Fuzzy Matching or Fuzzy lookups)

Using only “exact match” identification will almost always result in many duplicates in your CRM. There are simply too many possible variables in many fields for this to be useful.

A programmed technique for examining data and identifying customer records that are similar but not exact matches is known as “fuzzy matching.” It works by looking at how near two different data points are.The number of adjustments required to match the two data points is used to measure closeness. The number of insertions, deletion, and substitution differences required to make two different pieces of data exact matches is known as “edit distance.”

insertion: bar → barn
deletion: barn → bar
substitution: barn → bark

You’ll never find all the duplicates in a larger database without similar and fuzzy-matching methods in place.

This can cause your team to miss out on engaging with key players within the account, resulting in missed sales in account-based marketing and sales.

Almost any field in your CRM can benefit from fuzzy matching duplicate customer data. You’ll notice a variety of tiny modifications in your database, the most of which you wouldn’t notice until you saw it in action.

When you realize how widespread this issue is, you’ll naturally wonder how many of these flaws exist in your CRM and what impact they’re having on your bottom line.

7. External System IDs

External IDs are required for integrating and syncing two separate platforms in order to correlate client records between them.

To ensure that the contact data sync isn’t damaged, data deduplication processes frequently must take these external system IDs into account.

For instance, you might wish to send emails to your prospects and customers using marketing automation. You’ll want that to be reflected in your sales CRM as well, so reps get a complete picture of their interactions.

Connecting HubSpot and Salesforce can result in a slew of data issues between the two platforms. Integrations between any two CRMs or platforms that collect various types of data or use different field names to describe the same data are the same.

One of the fields in any popular CRM is an ID number that is used to identify the record. This field is ideal for detecting duplicate records and is frequently neglected during data cleansing processes.

For example, the Salesforce Contact ID may be used to identify duplicate contact entries in HubSpot. Because of changes to your HubSpot data, the sync may have created two separate entries when it should have appended or updated data in the original record.

8. Duplicate Detection Field or column

One major concern is that many duplicate customer records get through the cracks because the organization is only focused on identifying duplicates using specified fields, with no secondary checks in place to verify that no duplicates are missed.

For example, you might use the first name, last name, and phone number to find duplication. Checking that combination of fields catches most of your duplicate records.

When the first check fails to uncover a duplicate, adding a secondary check, such as First Name, Last Name, Address, can help you find and rectify free-floating duplicates that would otherwise go unnoticed.

9. Different Formats for Phone Numbers

In CRMs, phone numbers are frequently utilized to identify duplicate contacts and accounts.

It’s logical. It’s possible that a contact with two duplicate records used the same phone number for both. Furthermore, because mainline numbers are unlikely to change frequently, they can be used as a dependable field for duplicate detection.

However, using phone numbers as the primary field for this purpose has certain drawbacks.

First, a phone number can be formatted in your database in a variety of ways.

For example:

1234567890
123–456–7890
(123)-456–7890
123.456.7890
1–123–456–7890
123 456 7890
Etc.

Using the phone number field will almost always result in many unidentified duplicates in your database. This is a field that is likely to have a lot of typos and other problems. That means they could have spaces or numbers that are erroneous. They may include an extension number, causing the “#” to appear in some of your phone fields.

10. Cross-checking Similar Fields

Your CRM may collect data in fields that are identical to one another, increasing the chances of data being misplaced or redundant in your system.

For example, for a contact, you might collect numerous different types of phone numbers:

· Phone Number

· Mobile Number

· Company Phone Number

· Fax

It’s possible that a contact’s cell number was accidentally inserted into the company phone number column of a duplicate entry. Duplicate records like these would be difficult to identify unless you looked at duplicate data in numerous fields that were comparable.

11. Partial Matches

This is a case of duplicate data that Excel functions like VLOOKUP would have a hard time detecting.

Consider the following scenario. Assume you have a contact from a major business, such as a university, in your CRM. Because decisions are made individually in each department, contacts in various departments should be treated differently.

Partially matching could be used to find duplicates that are similar to each another. For example, partial matching may be used to find a duplicate record for a prospect who has their employer listed in various places:

· University of Washington

· University of Washington School of Business

· Washington University School of Business

You want to make sure that when you engage with this person, you have a thorough understanding of who they are and how to approach them. This could have an impact on their lead score and prospect prioritization, as well as provide crucial context to sales teams and dictate the marketing campaigns they get.

12. Advanced Duplicate Detection with Fuzzy dedupe

For popular CRMs like HubSpot, Salesforce, Intercom, and Pipedrive, Fuzzy dedupe delivers powerful duplicate detection and smart merging.

You can utilize Fuzzy dedupe’s pre-built templates to find duplicates using a variety of field combinations, such as:

· Identical names

· Identical names different IDs

· Same domain, same name

· Identical company with the same name

· The domain and last name are the same.

· Same phone number, same name

· And there are plenty more, including your own unique properties.

In fact, when you sign up for Fuzzy dedupe, the Customer Data Health Assessment checks your data for typical data inaccuracies and automatically tracks numerous sorts of duplication.

Fuzzy dedupe also comes with templates for “similar” or “fuzzy matching,” which are intended to help you catch more potentially duplicate records across your database. Many customer records are true duplicates, but typical matching algorithms will never detect them.

Data must be normalized before most deduplication operations can begin. This makes it easier to spot probable duplicates when using methods that look for exact-match duplicates in general.

Fuzzy dedupe, on the other hand, can detect duplicates that might otherwise go unnoticed. When we talked about “common concepts stated differently,” for example, we offered you the following example:

Microsoft Inc.
Microsoft Incorporated

These data indicate the same company, but exact match deduplication procedures would miss them.

By omitting common phrases in the values, Fuzzy dedupe can discover and match duplicates. In this situation, the common phrases would be “Inc.” and “Incorporated,” and despite the irregularities in the company naming convention, Fuzzy dedupe can match “Microsoft Inc.” and “Microsoft Incorporated.”

This feature isn’t just limited to corporation names. It can do the same thing with phone numbers, ignoring spaces, symbols, and formatting and comparing the digits in the field.

It’s critical to standardize your data. It’s crucial for data management and client satisfaction. However, even if the underlying data is untidy or inconsistent, firms without complete data standardization can still use Fuzzy dedupe to dedupe.

You’re also not restricted to the pre-built templates. In Fuzzy dedupe, you may construct your own duplicate detection templates utilizing any combination of fields and exact vs. similar matching.