Unifying Entities: Leveraging Name Matching Algorithms and AI/ML for Data Cleaning

Published in

Trademo-engg

3 min readJun 20, 2023

In the world of data science, collecting and curating insightful data is crucial for companies like Trademo.com. However, a common challenge faced by data scientists is dealing with unclean data, particularly when it comes to shipment data. Often, the names of buyers and suppliers are misspelled or inconsistent across different sources. In this tech blog, we will explore how Trademo.com tackles this issue by applying name matching algorithms and utilizing PaLM 2, a large language model, for entity deduplication.

The Challenge of Inconsistent and Mis-spelled Names: Shipment data obtained from various countries often contains names of buyers and suppliers that are 80% correct and 20% incorrect or misspelled. This inconsistency poses a significant hurdle in effectively utilizing the data. Therefore, Trademo.com employs a combination of name matching algorithms and advanced techniques like PaLM 2 to overcome this challenge and ensure the accuracy and reliability of the collected information.
Phonetic Algorithms: Trademo.com utilizes phonetic algorithms to transform names into phonetic representations. Algorithms such as Soundex, Metaphone, and Double Metaphone are applied to encode characters phonetically. These phonetic representations allow for comparing and identifying similar-sounding names, reducing the impact of misspellings or variations in naming conventions.
Edit Distance Algorithms: Edit distance algorithms are crucial in Trademo.com’s data cleaning process. Algorithms like Levenshtein distance is employed to calculate the similarity between two names by measuring the number of operations (insertions, deletions, substitutions) required to transform one string into another. By setting a threshold for similarity, Trademo.com can identify and merge similar names, even with minor differences or typos.
Vector Space Algorithms: Trademo.com leverages vector space algorithms to map names as vectors in a high-dimensional space. We use cosine similarity to calculate the similarity between name vectors. Terms close to each other in the vector space are considered potential matches. Using vector space algorithms, Trademo.com could effectively group similar entities, regardless of variations or misspellings.
PaLM 2 for Entity Deduplication: Besides name-matching algorithms, Trademo.com incorporates PaLM 2, a large language model, for entity deduplication. PaLM 2 (Path Language Model) is a state-of-the-art language model developed by OpenAI that understands and generates human-like text. Trademo.com utilizes PaLM 2’s capabilities to identify and merge duplicate entities within the shipment data. By processing and comparing textual information, PaLM 2 enhances the accuracy of entity deduplication, reducing the risk of duplicate entries in the data.
Continuous Learning and Improvement: Trademo.com recognizes that data quality improvement is an ongoing process. As new shipment data is collected and processed, the name matching algorithms and PaLM 2 continually learn and adapt. Feedback loops are incorporated into the system to validate matches and refine the algorithms over time. This iterative process ensures a continuously improving data cleaning and entity deduplication process, enhancing the overall data quality for buyers and suppliers.

Cleaning unclean data is an essential step in data science, and Trademo.com addresses this challenge by employing name matching algorithms and leveraging PaLM 2 for entity deduplication. Through the use of phonetic algorithms, edit distance algorithms, vector space algorithms, and PaLM 2, Trademo.com successfully unifies entities in shipment data, providing accurate and reliable insights to buyers and suppliers. By continuously learning and refining the algorithms, Trademo.com ensures a high level of data quality, empowering businesses to make informed decisions based on clean and unified data.

Written by Rupesh Dubey