Dealing with Duplicates

John Smith. J Smith. Smith, John. How to find if this John is the same as that John!

Thilina Rajapakse
Skil-AI

--

Photo by Ivy Barn on Unsplash

Have you ever searched for a contact on your phone and come up with several duplicate or near-duplicate entries? (I’m fairly certain this isn’t just me!) For me, this tends to happen when I forget that I already have a particular contact saved and create a new one for a new number. Duplicated contacts on a phone is a fairly minor annoyance and, despite my crappy memory, a fairly infrequent one at that.

However, for companies and organizations with huge databases of client information maintained by many different people, it is quite common to have multiple entries for the same entity with tiny variations in the data. These variations could include, for example, misspelt names, addresses written differently, use of special characters, and abbreviated/non-abbreviated names. Such duplicate entries can end up exploding the size of databases, which in turn can slow down entire systems. It can also make it difficult to do proper data analysis and can even cause misleading results.

Duplicate entries in this context refer to multiple entries created by mistake for the same entity. The entries are duplicate conceptually but are different in terms of the raw information content.

--

--

Thilina Rajapakse
Skil-AI

AI researcher, avid reader, fantasy and Sci-Fi geek, and fan of the Oxford comma. www.linkedin.com/in/t-rajapakse/