Open Source Fuzzy-Matcher: Finding Data Similarities in Records

Manish Bhatia
Intuit Engineering
Published in
5 min readAug 2, 2023

This article is co-authored by Manish Bhatia, Principal Software Engineer, and Jason Zesheng Chen, Technical Writer

Computers are pretty meticulous. They are incredible experts at spotting minute differences. In dealing with messy real-life data, however, this level of attention to detail can sometimes do us a disservice. If you find yourself combing through your data to identify Avenue, Ave., and Av., you might want to check out Fuzzy-Matcher, an open-source Java-based library to match and group similar elements in a collection of data.

At Intuit, we strive to empower smooth and efficient data management. Business data often involve duplicate contacts, misspelled names, similar numeric values, or inconsistent abbreviations. In computer science and data analysis, fuzzy matching refers to techniques that identify and match similar or approximate strings of text or data. It is used when exact matching is not possible or practical due to variations in spelling, punctuation, formatting, or other factors.

To learn more about fuzzy matching, its origins, and typical use cases, check out this excellent historical overview.

To implement fuzzy matching techniques in Java, we wrote and open-sourced the Fuzzy-Matcher library. In this article we’ll show how, using Fuzzy-Matcher, you can effectively determine if the target items are similar in characters (individual and sequences), sound, and value range, among others, along with a similarity score that measures how close these data points are.

Fuzzy-Matcher Logo
Fuzzy-Matcher Logo

What can be fuzzy and how they are matched

To appreciate Fuzzy-Matcher in action, let’s consider a few ways data elements can be fuzzy. As a simplified example, imagine you run a small business and your bookkeeping records have entries that look like this:

NOTE: the names, emails, addresses, and transaction amounts in this table are entirely fictional. Any resemblance to real persons, emails, and physical addresses is purely accidental.

By design, computers will quickly spot that these are indeed separate entries. A human, on the other hand, might notice that some of these entries could belong to the same people, despite not being exact matches.

If you’re trying to merge the records or link them to one another, it is crucial to have a way to measure their similarity. Here are some possibilities for how Fuzzy-Matcher might identify them.

(1). Some of the names might be from the same person: “Wayne Grace Jr.” and “Grace, Hilton Wayne”.

With a simple tokenization process, each word can be considered a token, and if another element has the same word they are scored on the number of matching tokens. In this example, the words Wayne and Grace match 2 words out of 3 total in each element. A scoring mechanism will match them with a result of 0.67

(2). Some of the names sound very similar, so they might just be typos or spelling variations of the same name: “Steve Wilson” and “Stephen Wilkson”.

Using the Soundex Indexing System, which provides encodings to names according to how they sound, Steven & Stephen will encode to S315, whereas the words Wilson & Wilkson encode to W425. This allows both elements to match exactly, and score at 1.0.

(3). The same person puts down two different email addresses: “james_parker@yahoo.com” and “parker.james@gmail.com”

In this case, we use N-grams to identify similar entries. For example, the two email addresses can be split into trigrams.

  • parker.james -> [par, ark, rke, ker, er., r.j, .ja, jam, ame, mes]
  • james_parker -> [jam, ame, mes, es_, s_p, _pa, par, ark, rke, ker]

Comparing these N-grams we have 7 out of the total 10 tokens match exactly, which gives a score of 0.7

(4). Two different entries record the same transaction, with the only difference being the rounding convention (e.g., how many digits to keep after the decimal point): “89.00” and “89.17”.

In this case, the match is done, not on tokens being equal, but on the closeness (the neighborhood range) in which the values appear. This closeness is configurable where a 99% closeness will match them with a score of 1.0.

How Fuzzy-Matcher works

To achieve the above, Fuzzy-Matcher will match and score your data. It accepts data in a list of entities called document, which can contain 1 or more element (like names, addresses, emails, etc). Internally, each element is further broken down into 1 or more tokens, then matched using a configurable match type.

This diagram provides a high-level overview of the process:

The architecture of Fuzzy-Matcher

Fuzzy-Matcher comes predefined with a list of `element types` with sensible defaults that can be reconfigured later by the user.

Fuzzy-Matcher supplies three kinds of match services:

1. Match a list of documents: useful for checking for potential duplicates in an existing list of documents.

matchService.applyMatchByDocId(List<Document> documents)

2. Match a list of documents with an existing list: useful for matching a new list of documents with an existing list in your system. For example, checking a bulk import against existing data

matchService.applyMatchByDocId(List<Document> documents, List<Document> matchWith)

3. Match a document with an existing list: useful for creating new documents, to ensure that a similar document does not already exist

matchService.applyMatchByDocId(Document document, List<Document> matchWith)

Why not try it out yourself? Fuzzy-Matcher is published on Maven Central. To use it, simply add the following to your pom.xml file.

<dependency>
<groupId>com.intuit.fuzzymatcher</groupId>
<artifactId>fuzzy-matcher</artifactId>
<version>1.2.0</version>
</dependency>

Alternatively, clone the Fuzzy-Matcher GitHub repository and run this command to start using it on your computer.

mvn clean install

To recap, in this article we’ve shown you possible scenarios where fuzzy matching can be useful, various ways Fuzzy-Matcher can help, and steps you’ll need to take to start using it.

Learn more and get started!

If you are interested in learning more, check out this YouTube video below, in which Intuit’s Lucy Shen and Manish Bhatia give us a 10-minute tour of its use cases, architecture, and installation. Or, head straight to Fuzzy-Matcher’s GitHub repository for more in-depth documentation and learn about how you can contribute to this open-source project!

To find out more about open source at Intuit, or just to learn about open source history, trivia, and best practices, visit Intuit Open Source on LinkedIn!

--

--