Entity Matching using TF-IDF in Snowpark Python

During a recent POC with a customer, we were exploring ways in which we could implement a simple yet fast method of matching customer records entirely in Snowflake. This is a common requirement for MDM in order to resolve new data being ingested, and avoid duplicate data.

The initial method they tried was using a fuzzy matching library, in this case fuzzywuzzy, inside a Python UDTF (User Defined Table Function), and then executing this against new data. The intention being, as new records were ingested these would then be run against a matching function which would score the new data against the existing set, with a confidence score — the highest confidence score would be used to determine the existing record that it was best matched with.

The nature of this approach led to very long (comparatively speaking) execution times. In the simple scenario of trying to match 100 new rows against an existing dataset of 1 million golden records, the execution time was about 4 minutes on an XS warehouse.

And here’s why…if you think about this, for each row of the 100 new records, it has to be evaluated against every single record of the 1 million — essentially the matching has to be done 100 million times. As you scale up on either side (new data or golden records), that execution time is going to increase.

Now…we could increase warehouse size, this would give us more nodes and with the added compute there’d be some parallelization introduced as part of the SQL query that would speed things up. But this isn’t what we want — we want to run this as fast and cost effectively as possible.

This is where we turn to a different method; using n-grams with TF-IDF (Term Frequency — Inverse Document Frequency. By using TF-IDF Vectorization in the scikit-learn library, along with cosine-similarity we can create a much faster method for performing entity matching.

Getting Started

First let’s look at the main imports we’ll need to bring into our notebook:

Good Data and Bad Data

Next, we need some customer data to experiment with — I’m including this step so that this can be easily reproduced in your own environment if needed.

Note — All the following code examples will be Snowpark Python based, however for the data synthesis I will embed SQL into Snowpark to make life a little easier.

To get started, we need a way to create fake data - this is where we will use the Python Faker library in a Snowflake UDF:

With this now registered, we can create a table of fake data:

And finally, we take a small subset of this data and apply some random modifications to it in order to create some partially matchable data:

We now have 2 sets of data; some nice clean data, and then new data that has different cases, abbreviated words and missing values:

Main Data — 1 million rows

A nice set of clean data

New Data — 100 rows

Messy data — missing values, abbreviated words, different cases

So our new data, will be a partial match to the existing data but not 100% — the process we’re about to walkthrough will provide us the index of the highest match score, and at that point we can determine what % threshold we find acceptable.

As we move forward, for the sake of simplicity we’ll be using a single column in both tables where the data is concatenated i.e:

N-grams

The first step of our data processing will convert our matching string to ngrams that the TF-IDF will assess its occurrence. During initial research and testing I stumbled over an interesting article about using 3 letter n-grams, and it’s impact on accuracy — testing certainly demonstrated the benefit of this approach, and so we can take this method and put this into a Python UDF like so:

Running this against the full name and address of an entity would yield this:

We’ll use this function on our main data, and on the data we want to match with.

Building a TF-IDF Matrix

With our main data that can be transformed with our n-gram function, we can now use this to build our TF-IDF matrix — in this example we’ll be using the scikit-learn library.

We want this to run entirely in a Snowflake Stored Procedure, so our code will be deployed and then executed from there. As part of this process, we’ll also be creating an index table — this is needed because when we run our cosine-similarity method, we’ll want to pull the index of the record that has the highest match. We then need to reference this back to the main data — if the underlying data is changed after our TF-IDF matrix is created then there’s a potential mis-match so part of the matrix creation is storing an index value.

Note — we do not use the session.put method to save the files to stage, the /tmp directory is limited to 500MB on a Standard warehouse and our matrix is just a little larger coming in at 578,889,984 bytes.

With this registered we can call this procedure — doing so took 1m 42s for 1 million rows. On an XS warehouse…pretty impressive!

Let’s also take a look at our Index table and see the converted n-gram value and assigned index — we’ll use this next:

With our model artifacts ready, we can now create a vectorized UDF for batch inference. In this case we’ll load the vectorizer and TF-IDF matrix into memory, and then run the cosine of similarity against the query (the string we want to match) and get an index value back — this will then be referenced against the index table to provide us the highest match.

The use of a vectorized UDF here means we can process multiple rows in a single invocation of the function which will greatly improve performance:

Now the UDF is available, we can query the NEW_DATA table and use our n-gram and matching UDF to give us the corresponding index record that is the highest match, along with the matching score:

The total runtime for this was…8.33s! Compared to 4 minutes with the previous approach. That’s 28 times faster, and again using an XS warehouse!

So now we have the new data (i.e. bad data) and the matched index from our index table, along with the confidence score:

Joining this data to our index table, and then to our original data set we can see the new data, the matched record, and the confidence score:

Result:

For better clarity here are some examples:

  • New Data: D Conway 385 Brian Harbor ME
  • Matched Data: Daniel Conway 385 Brian Harbor Karenview ME 76487
  • Score: 0.73564

So with an abbreviated first name, missing city and ZIP — we get a confidence score of 73.5%

  • New Data: MICHAEL HENDRIX 637 Beth Streets RI 41315
  • Matched Data: Michael Hendrix 637 Beth Streets Schmidtview RI 41315
  • Score: 0.80345

In this example, we have a capitalization mismatch on name, and we’re missing the city. This is giving us a score of 80.3%

Considerations

  1. In this demo we built our matrix on 1 million rows which yielded a matrix size between 500–600MB. For larger shapes you may likely to start considering using Snowpark-optimized warehouses in order to handle the bigger matrices that would be produced by the fitting process — standard warehouses have a memory limit of 3.5GB but Snowpark-optimized warehouses are considerably higher.
  2. You might be able to improve performance by tweaking the max_batch_size parameter in the vectorized UDF. Obviously moving up warehouse size will also improve speed.
  3. It would be worth investigating using a nearest neighbor technique instead of cosine-similarity to see if there are better matching accuracy and/or performance — I suspect this might be the case, so something to explore in another post.

Summary

With this technique we’ve seen we’re able to perform a much more performant matching process that using string matching functions that evaluate all combinations on new and existing datasets.

This was intended to be a simple example, in a real world scenario there would likely be some complexity where we may do matching against individual fields — some values probably have higher weight towards matching than others and so for higher accuracy you may consider building TF-IDF matrices for individual fields (or groups of fields).

Finally — shout out to Hex which I used to create this demo, an awesome notebook environment I encourage you to check out.

References

  1. https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
  2. https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html

Updated 11/03/22 — Added the necessary Python imports

--

--