Similarity Matching using Snowflake Cortex (vector embedding)

Aswinee Rath
6 min readMay 10, 2024

--

Requirement

Finding duplicate names or similar entities is a common requirement in any data analytics project.

Finding similarity or duplicate rows in Customer records

Here are a few use cases I have come across where we had to use similarity-matching algorithms

  • Duplicate names and addresses in our customer database
  • Finding similar product names or descriptions in product catalogues
  • Finding duplicate or similar parts and descriptions in the manufacturing industry

We can use simple text-matching algorithms for small datasets of a few hundred rows, and most analytics tools provide some simple mechanisms to identify duplicates. However, in real life, we usually need to compare our datasets against thousands to millions of data elements. That’s when we need to use some programmatic algorithms for speed and efficiency.

Python scripts have become very handy for these complex, similarity-matching projects. I have been using techniques described by my colleague's blog “Entity Matching using TF-IDF and Ngrams.” The challenge was that we had to write a script for every project as it needed to build a model for each dataset, and my customers needed to update those models as they got new data constantly. Using vector databases and RAG techniques is another popular method, but it requires building an infrastructure to manage or connect to a cloud-based vector database to vectorize my data.

Snowflake Cortex Solution

Luckily, I work for Snowflake, and all my customers use the most powerful data platform on this earth, Snowflake ❄️️️️️❄️❄️

Snowflake introduced Cortex functions, which use LLM models and ML functions to handle complex algorithms. Cortex includes vector embedding functions to vectorize data. Many blogs, like this one, give a deeper insight into vector embedding. Snowflake has published a few Quick Start guides on how to use Vector Embedding and other Cortex functionality.

I converted my similarity or entity-matching projects using Snowflake Cortex. It was amazingly simple and provided much faster results. The best part was that I didn't have to build any complex Python models and constantly manage them. Let's walk through an example to see how simple it is.

Before we dive in, here are the functionalities available in Snowflake to use vector embedding.

EMBED_TEXT_768( <model>, <text> ) -- returns VECTOR

VECTOR_INNER_PRODUCT(<vector>, <vector>) -- multiplies two vectors

VECTOR_L2_DISTANCE(<vector>, <vector>) -- Euclidean line distance between two
vectors represented 0 for identical to inifnite for farther distance.

VECTOR_COSINE_SIMILARITY(<vector>, <vector>) -- angle between two vectors,
1 ideintcial, 0 orthogonal vectors have a similarity of 0, -1 oppiste

VECTOR({INT | FLOAT}, <dimension>) -- vector data type to persist vectors

An example

In this example, we will create a million customer records, including their name and address. We will compare this data against new data to identify duplicates or similar records.

Let's build a large sample of addresses and compare it against new data for our matching exercise. I will use Snowflake SQL to build this demo. I assume you have created a test database where you want to create these data and have appropriate privileges to create the objects using Vector data types.

DATA SETUP

Let's create some sample data as our master data. I will use a Python function that uses the Python package Faker to create sample data.

CREATE OR REPLACE FUNCTION py_faker(locale String,provider String,parameters Variant)
returns Variant
language python
runtime_version = 3.8
packages = ('faker','simplejson')
handler = 'fake'
as
$$
from faker import Faker
import simplejson as json
def fake(locale,provider,parameters):
if type(parameters)._name_=='sqlNullWrapper':
parameters = {}
fake = Faker(locale=locale)
return json.loads(json.dumps(fake.format(formatter=provider,**parameters), default=str))
$$;

Now, let's create a table that will host our master data. We are creating one million rows.

create or replace table customer_master_data as
select
uuid_string() id,
py_faker('en_us','first_name',null)::varchar firstname ,
py_faker('en_us','last_name',null)::varchar lastname,
py_faker('en_us','street_address',null)::varchar street,
py_faker('en_us','city',null)::varchar city,
py_faker('en_us','postcode',null)::varchar zipcode,
py_faker('en_us','state_abbr',null)::varchar state,
concat(firstname,' ',lastname) name,
concat(street,' ',city,' ',state, ' ', zipcode) address,
concat(name,' ',address) full_details
from table(generator(rowcount => 1000000));

Let's check how our data looks.

select * from customer_master_data
limit 10;
Existing Customer Data

As we can see, our column full-details has all the concatenated data that we want to match. Let’s create another table where we will keep the vectorized data of our full_details columns. In real life, you will use automated processes such as streams and tasks to visualize the new data as it lands.

create or replace table customer_master_details_vector
as
select ID,snowflake.cortex.EMBED_TEXT_768('e5-base-v2', full_details) as full_details_vector
from customer_master_datasq;

Let's create a table to simulate our new inbound data. This is the data we will compare against master data to find similarities. We will use our master data and randomly change some info for a few rows.

create or replace table new_customer_data as
with rnd_factor as
(
select
uniform(1,5,random()) as first_name_factor,
uniform(1,3,random()) as last_name_factor,
uniform(1,3,random()) as address_factor,
uniform(1,3,random()) as city_factor,
uniform(1,5,random()) as state_factor,
uniform(1,5,random()) as zip_factor,
*
from
customer_master_data
limit 10
)
select
case
when first_name_factor = 1 then upper(left(firstname,1))
when first_name_factor = 2 then upper(firstname)
when first_name_factor = 3 then ''
else firstname
end as firstname_new, case
when last_name_factor = 2 then upper(lastname)
else lastname
end as lastname_new, case
when address_factor = 1 then upper(street)
else street
end as street_new, case
when city_factor = 1 then upper(city)
else city
end as city_new, case
when zip_factor = 1 then ''
else zipcode
end as zipcode_new, case
when state_factor = 1 then lower(state)
when state_factor = 2 then ''
else state
end as state_new, concat(firstname_new,' ',lastname) as name,
concat(street_new,' ',' ',' ',state_new, ' ', zipcode_new) as address_new,
concat(firstname_new,' ',lastname_new,' ',address_new) as full_details
from
rnd_factor;

Let's see how our new data looks.

select * from new_customer_data
limit 10;
Sample new data

Let's vectorize the full_detailscolumn for our new data.

create or replace table new_customer_data_vector
as
select *, snowflake.cortex.EMBED_TEXT_768('e5-base-v2', full_details) as full_details_vector
from new_customer_data;

LET'S TEST IT

That's it; we have set up our data and created vector embedding for the data we need for similarity matching.

We will use the Snowflake function vector_cosine_similarity to find similarities. Identical vectors have a cosine similarity of 1; two orthogonal vectors will be 0, and opposite vectors will be -1

Before we test it, you can see below we are doing a cartesian join with new_customer_data and customer_master_data. So, when testing, only compare a small set of new data. In real life you will be always comparing every new data you receive against the existing data, hence we should expect the new data will be only few rows.

-- identical vectors have a cosine similarity of 1 
-- two orthogonal vectors will be 0
-- opposite vectors will be -1
SELECT
v.id,
m.full_details as org_data,
n.full_details as new_data,
VECTOR_COSINE_SIMILARITY(v.full_details_vector, n.full_details_vector) AS score
FROM
customer_master_details_vector v inner join
customer_master_data m on v.id = m.id,
new_customer_data_vector n
QUALIFY RANK() OVER(PARTITION BY n.full_details ORDER BY score desc) = 1;
Comparing new data with existing data

Here is an example of getting each data through a prompt, not from another database. In this example, we are vector embedding the input data (prompt) and comparing it against our data to see the closest matching data.

SET prompt = 'Samantha Higgins 31574 RITTER VIA SUITE 500 nj 42546';
SELECT
v.id,
c.full_details,
$prompt,
VECTOR_COSINE_SIMILARITY(v.full_details_vector, snowflake.cortex.embed_text('e5-base-v2', $prompt)) AS score
FROM
customer_master_details_vector v
inner join customer_master_data c
on v.id = c.id,
ORDER BY
score DESC
LIMIT 1;
Finding the closest match for a given prompt

COST CONSIDERATION

In my test snowflake account, the above comparison queries took less than a few seconds. I used an X-small warehouse, as most of the work is done by the CORTEX vector functions, which run in a GPU-based cloud service environment. The CORTEX documentation indicates vector embedding costs 0.03 cents for a million tokens. Hence, this exercise cost me less than a few cents. In the production environment, we can expect small batches of data to be compared frequently. Hence, the process is very cost-effective.

CONCLUSION

Snowflake CORTEX brings game-changing LLM, including vector embedding functionality. It’s simple to use, secure and governed, as data never leaves the Snowflake account. We can use vector embedding and other LLM functions under the Snowflake CORTEX umbrella to build a full end-to-end practical business solution with a few lines of code.

--

--