String Matching with BigQuery Machine Learning: A Deep Dive into Solving the No-Matching-Key Puzzle

Vikas Pandey
Google Cloud - Community
6 min readSep 5, 2024

In the dynamic world of data analysis, joining tables based on common keys is a fundamental operation. But what happens when the keys you need are misaligned or simply don’t exist? This is a common scenario when dealing with real-world data, where variations in naming conventions, typos, and incomplete information can hinder your analysis.

In this comprehensive guide, we will delve into the powerful capabilities of BigQuery Machine Learning (BQML) to address the string matching problem, even when your tables lack those convenient matching keys. We’ll walk through the process step-by-step, providing insights and considerations to help you achieve accurate and efficient string matching.

The Challenge of Disparate Keys

Imagine you have two tables and have no keys to join :

  • Table A -Master(Customer Address): This table contains valuable customer master details like name, address,mobile etc.
CREATE TABLE `myproject.mydataset.CustomerA`
(
C_NAME STRING,
C_NAME_DISPLAY STRING,
C_URL STRING,
C_STREET STRING,
C_STREET_2 STRING,
C_CITY STRING,
C_STATE STRING,
C_ZIP INT64,
C_COUNTRY STRING,
Mobile STRING
);
CustomerA

Consider the above 3 records in Table A of a customer which has scenarios like full address,partial address,missing address or variation in customer name etc.

  • Table B -Multiple Address(Customer Address): This table captures multiple latest address updated data by users, including customer_name, address and other relevant information.
CREATE TABLE `myproject.mydataset.CustomerB`
(
C_NAME STRING,
C_NAME_DISPLAY STRING,
C_URL STRING,
C_STREET STRING,
C_STREET_2 STRING,
C_CITY STRING,
C_STATE STRING,
C_ZIP INT64,
C_COUNTRY STRING
);
CustomerB

In the above reference Table B, the same customer has the 2 correct updated records of address

Goal is to combine these tables to find the nearest matching address record. However, a roadblock emerges: There is no key exists in both the tables. Also Data discrepancy can arise from various factors:

  • Name Variations: A customer might be listed as “John Smith” in one table and “John A. Smith” or even “J. Smith” in the other.
  • Typos: Human error can introduce misspellings or other inaccuracies in the data entry process.
  • Incomplete Information: One table might contain only partial customer names or nicknames or addresses.

Enter BQML: Text Embedding & Cosine Distance to the Rescue

BigQuery Machine Learning equips you with sophisticated tools to overcome this challenge:

1.Data Preparation: Laying the Foundation

Cleaning data for Table A and Table B with below simple methods :

  • Converting string to lowercase
  • Removing special characters
  • Concatenation & Sorting the search column groups
  • Assigning a HashKey for every record for uniqueness( Will be used for joining both tables for final result)
  • Assigning a RowNumber as a Unique Key

Functions used for Data Cleaning

CREATE OR REPLACE FUNCTION `myproject.mydataset.clean_tokenize_pipe`(a STRING) AS (
ARRAY(
SELECT
tokens
FROM
UNNEST(
SPLIT(

a
, '|'
)
) tokens
)
);
CREATE OR REPLACE FUNCTION `myproject.mydataset.clean_sort_tokenize_pipe`(a STRING) AS (
ARRAY(
SELECT
x
FROM
UNNEST(`myproject.mydataset.clean_tokenize_pipe`(a))
AS x
ORDER BY x)
);

Cleaning Source Table A with below procedure

//Simple Data Cleaning for Table CustomerA

DROP TABLE if exists `myproject.mydataset.CustomerA_CleanedData_1`;
CREATE or REPLACE TABLE `myproject.mydataset.CustomerA_CleanedData_1`

as

SELECT
ROW_NUMBER () OVER (ORDER BY AHashKey) rnum,
Name,
RTRIM(ARRAY_TO_STRING(`myproject.mydataset.clean_sort_tokenize_pipe`(Base_Matching_Text),'')) Base_Matching_Text_Sorted,
Address,
Address2,
City,
State,
Zip,
Country,
AHashKey
FROM (

SELECT *,
CONCAT(
COALESCE(Name,''),'|',
COALESCE(Address,''),'|',
COALESCE(Address2,''),'|',
COALESCE(City,''),'|',
COALESCE(State,''),'|',
COALESCE(Zip,'|'),'|',
COALESCE(Country,'|')
) as Base_Matching_Text,

FARM_FINGERPRINT
(
CONCAT(
COALESCE(Name,''),
COALESCE(Address,''),
COALESCE(Address2,''),
COALESCE(City,''),
COALESCE(State,''),
COALESCE(Zip,''),
COALESCE(Country,'')
)
) as AHashKey

FROM
(
SELECT
REPLACE(REPLACE(regexp_replace(lower(C_NAME),r"[^a-zA-Z0-9\s]+",''),chr(10),'') ,chr(13),'') as Name,
TRIM(regexp_replace(LOWER(C_STREET),r"[^a-zA-Z0-9\s]+",'')) as Address,
TRIM(regexp_replace(LOWER(C_STREET_2),r"[^a-zA-Z0-9\s]+",'')) as Address2,
LOWER(C_CITY) as City,
LOWER(C_STATE) as State,
LOWER(cast(C_ZIP as string)) as Zip,
LOWER(C_COUNTRY) as Country,
LOWER(mobile) as Mobile
FROM `myproject.mydataset.CustomerA` a
)
);

Cleaning Target Table B with below Stored Procedure

//Simple Data Cleaning for Table B

DROP TABLE IF EXISTS `myproject.mydataset.CustomerB_CleanedData_1`;
CREATE or REPLACE TABLE `myproject.mydataset.CustomerB_CleanedData_1`

as

SELECT
ROW_NUMBER () OVER (ORDER BY BHashKey) rnum,
Name,
RTRIM(ARRAY_TO_STRING(`myproject.mydataset.clean_sort_tokenize_pipe`(Base_Matching_Text),'')) Base_Matching_Text_Sorted,
Address,
Address2,
City,
State,
Zip,
Country,
BHashKey
FROM (

SELECT
*,

CONCAT(
COALESCE(Name,''),'|',
COALESCE(Address,''),'|',
COALESCE(Address2,''),'|',
COALESCE(City,''),'|',
COALESCE(State,''),'|',
COALESCE(Zip,'|'),'|',
COALESCE(Country,'|')
) as Base_Matching_Text,

FARM_FINGERPRINT(
CONCAT(
COALESCE(Name,''),
COALESCE(Address,''),
COALESCE(Address2,''),
COALESCE(City,''),
COALESCE(State,''),
COALESCE(Zip,''),
COALESCE(Country,'')
)
) as BHashKey

FROM
(
SELECT
REPLACE(REPLACE(regexp_REPLACE(LOWER(C_NAME),r"[^a-zA-Z0-9\s]+",''),chr(10),'') ,chr(13),'') as Name,
TRIM(regexp_REPLACE( LOWER(C_STREET),r"[^a-zA-Z0-9\s]+",'')) as Address,
TRIM(regexp_REPLACE(LOWER(C_STREET_2),r"[^a-zA-Z0-9\s]+",'')) as Address2,
LOWER(C_CITY) as City,
LOWER(C_STATE) as State,
LOWER(cast(C_ZIP as string)) as Zip,
LOWER(C_COUNTRY) as Country

FROM `myproject.mydataset.CustomerB` a
)
);

2. Creating Embedding on Prepared data

Step 1 : Create an BigQuery remote model external connection

Step 2 : Add IAM permissions to your BigQuery Service account as below :

Service Account Permissions

Step 3 : Create a Remote Model(use stable embedding models)

CREATE OR REPLACE MODEL `myproject.mydataset.textembedding004model`
REMOTE WITH CONNECTION `projects/myproject/locations/us/connections/address_embedding_conn`
OPTIONS (ENDPOINT = 'text-embedding-004');

Step 4 : Create Embedding for Table A

DROP TABLE IF EXISTS `myproject.mydataset.CustomerA_Embeddings_2`;
CREATE TABLE
`myproject.mydataset.CustomerA_Embeddings_2` AS
SELECT
*
FROM
ML.GENERATE_TEXT_EMBEDDING( MODEL `myproject.mydataset.textembedding004model`,
(
SELECT
AHashKey,
Base_Matching_Text_Sorted AS content
FROM
`myproject.mydataset.CustomerA_CleanedData_1`

));

Step 5 : Create Embedding for Table B

DROP TABLE IF EXISTS `myproject.mydataset.CustomerB_Embeddings_2`;
CREATE TABLE
`myproject.mydataset.CustomerB_Embeddings_2` AS
SELECT
*
FROM
ML.GENERATE_TEXT_EMBEDDING( MODEL `myproject.mydataset.textembedding004model`,
(
SELECT
BHashKey,
Base_Matching_Text_Sorted AS content
FROM
`myproject.mydataset.CustomerB_CleanedData_1`

));

Step 6 : Get Cosine Distance

CREATE OR REPLACE TABLE `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` 
AS

WITH
mld as
(
SELECT
ROW_NUMBER() OVER(PARTITION BY AHashkey ORDER BY AHashkey,distance ASC) AS row_num,*
FROM
(

SELECT
s.AHashkey,
c.BHashkey,
ML.DISTANCE(s.text_embedding,c.text_embedding,'COSINE') AS distance
FROM
`myproject.mydataset.CustomerA_Embeddings_2` s,
`myproject.mydataset.CustomerB_Embeddings_2` c

)

)
SELECT * FROM mld WHERE row_num <=2;

In this query, only top 2 matching records with minimum cosine distance is considered. As per your requirement you can change the parameter.

Step 7 : Create a unified table with top n priority records(n=2 , in this case)

DROP  TABLE IF EXISTS `myproject.mydataset.CustomerA_CustomerB_MDM`;
CREATE or REPLACE TABLE `myproject.mydataset.Customer_A_CustomerB_MDM`
as
SELECT
A.AHashkey as AHashkey ,
B.BHashkey as BHashkey ,
dc1.distance as B_distance_rank1,
1 as Priority,
A.Name as A_Name,
B.Name as B_Name,
A.Address as A_Address,
A.Address2 as A_Address2,
A.City as A_City,
A.State as A_State,
A.Zip as A_Zip,
A.Country as A_Country,
B.Address as B_Address,
B.Address2 as B_Address2,
B.City as B_City,
B.State as B_State,
B.Zip as B_Zip,
B.Country as B_Country,
FROM `myproject.mydataset.CustomerA_CleanedData_1` A
INNER JOIN `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` dc1 ON dc1.AHashKey=A.AHashkey AND dc1.row_num=1
INNER JOIN `myproject.mydataset.CustomerB_CleanedData_1` b ON dc1.BHashKey=B.BHashkey AND dc1.row_num=1
WHERE dc1.distance <=0.2

UNION ALL

SELECT
A.AHashkey as AHashkey ,
B.BHashkey as BHashkey ,
dc1.distance as B_distance_rank1,
2 as Priority,
A.Name as A_Name,
B.Name as B_Name,
A.Address as A_Address,
A.Address2 as A_Address2,
A.City as A_City,
A.State as A_State,
A.Zip as A_Zip,
A.Country as A_Country,
B.Address as B_Address,
B.Address2 as B_Address2,
B.City as B_City,
B.State as B_State,
B.Zip as B_Zip,
B.Country as B_Country,
FROM `myproject.mydataset.CustomerA_CleanedData_1` A
INNER JOIN `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` dc1 ON dc1.AHashKey=A.AHashkey AND dc1.row_num=2
INNER JOIN `myproject.mydataset.CustomerB_CleanedData_1` b ON dc1.BHashKey=B.BHashkey AND dc1.row_num=2
WHERE dc1.distance <=0.2;

Note : You can play with a cut-off threshold value for distance for desired results, in this case its 0.2

3. Compare the Final results

SELECT 
A_Name,B_Name,Priority,
CONCAT(

COALESCE(A_Address,''),'|',
COALESCE(A_Address2,''),'|',
COALESCE(A_City,''),'|',
COALESCE(A_State,''),'|',
COALESCE(A_Zip,'|'),'|',
COALESCE(A_Country,'|')) A_Address,
CONCAT(

COALESCE(B_Address,''),'|',
COALESCE(B_Address2,''),'|',
COALESCE(B_City,''),'|',
COALESCE(B_State,''),'|',
COALESCE(B_Zip,'|'),'|',
COALESCE(B_Country,'|')) B_Address

FROM `myproject.mydataset.Customer_A_CustomerB_MDM`
WHERE A_name LIKE 'google%' ORDER BY AHashkey,Priority
Final Output with filling the nearest record

As per our goal, in this final output every record from Table A has 2 nearest matching records from Table B, this can help to create a golden record for any Master Data Management.Finding missing data from a reference table on similarity.Solving use cases for data patching etc.

Conclusion

BigQuery Machine Learning empowers you to conquer the string matching challenge, even when traditional key-based joins fall short. By harnessing the power of string matching and BQML’s capabilities, you can unlock valuable insights hidden within your data, enabling more comprehensive and accurate analysis.

Remember, the key to success lies in experimentation, iteration, and fine-tuning your approach to achieve optimal results for your specific use case. Embrace the power of BQML, and let your data tell its story, even when the keys are elusive!

Feel free to reach out on LinkedIn and Medium, if you have any specific questions or want to explore advanced techniques for string matching in BigQuery ML!

--

--