String Matching with BigQuery Machine Learning: A Deep Dive into Solving the No-Matching-Key Puzzle
In the dynamic world of data analysis, joining tables based on common keys is a fundamental operation. But what happens when the keys you need are misaligned or simply don’t exist? This is a common scenario when dealing with real-world data, where variations in naming conventions, typos, and incomplete information can hinder your analysis.
In this comprehensive guide, we will delve into the powerful capabilities of BigQuery Machine Learning (BQML) to address the string matching problem, even when your tables lack those convenient matching keys. We’ll walk through the process step-by-step, providing insights and considerations to help you achieve accurate and efficient string matching.
The Challenge of Disparate Keys
Imagine you have two tables and have no keys to join :
- Table A -Master(Customer Address): This table contains valuable customer master details like
name
,address
,mobile
etc.
CREATE TABLE `myproject.mydataset.CustomerA`
(
C_NAME STRING,
C_NAME_DISPLAY STRING,
C_URL STRING,
C_STREET STRING,
C_STREET_2 STRING,
C_CITY STRING,
C_STATE STRING,
C_ZIP INT64,
C_COUNTRY STRING,
Mobile STRING
);
Consider the above 3 records in Table A of a customer which has scenarios like full address,partial address,missing address or variation in customer name etc.
- Table B -Multiple Address(Customer Address): This table captures multiple latest address updated data by users, including
customer_name
,address
and other relevant information.
CREATE TABLE `myproject.mydataset.CustomerB`
(
C_NAME STRING,
C_NAME_DISPLAY STRING,
C_URL STRING,
C_STREET STRING,
C_STREET_2 STRING,
C_CITY STRING,
C_STATE STRING,
C_ZIP INT64,
C_COUNTRY STRING
);
In the above reference Table B, the same customer has the 2 correct updated records of address
Goal is to combine these tables to find the nearest matching address record. However, a roadblock emerges: There is no key exists in both the tables. Also Data discrepancy can arise from various factors:
- Name Variations: A customer might be listed as “John Smith” in one table and “John A. Smith” or even “J. Smith” in the other.
- Typos: Human error can introduce misspellings or other inaccuracies in the data entry process.
- Incomplete Information: One table might contain only partial customer names or nicknames or addresses.
Enter BQML: Text Embedding & Cosine Distance to the Rescue
BigQuery Machine Learning equips you with sophisticated tools to overcome this challenge:
1.Data Preparation: Laying the Foundation
Cleaning data for Table A and Table B with below simple methods :
- Converting string to lowercase
- Removing special characters
- Concatenation & Sorting the search column groups
- Assigning a HashKey for every record for uniqueness( Will be used for joining both tables for final result)
- Assigning a RowNumber as a Unique Key
Functions used for Data Cleaning
CREATE OR REPLACE FUNCTION `myproject.mydataset.clean_tokenize_pipe`(a STRING) AS (
ARRAY(
SELECT
tokens
FROM
UNNEST(
SPLIT(
a
, '|'
)
) tokens
)
);
CREATE OR REPLACE FUNCTION `myproject.mydataset.clean_sort_tokenize_pipe`(a STRING) AS (
ARRAY(
SELECT
x
FROM
UNNEST(`myproject.mydataset.clean_tokenize_pipe`(a))
AS x
ORDER BY x)
);
Cleaning Source Table A with below procedure
//Simple Data Cleaning for Table CustomerA
DROP TABLE if exists `myproject.mydataset.CustomerA_CleanedData_1`;
CREATE or REPLACE TABLE `myproject.mydataset.CustomerA_CleanedData_1`
as
SELECT
ROW_NUMBER () OVER (ORDER BY AHashKey) rnum,
Name,
RTRIM(ARRAY_TO_STRING(`myproject.mydataset.clean_sort_tokenize_pipe`(Base_Matching_Text),'')) Base_Matching_Text_Sorted,
Address,
Address2,
City,
State,
Zip,
Country,
AHashKey
FROM (
SELECT *,
CONCAT(
COALESCE(Name,''),'|',
COALESCE(Address,''),'|',
COALESCE(Address2,''),'|',
COALESCE(City,''),'|',
COALESCE(State,''),'|',
COALESCE(Zip,'|'),'|',
COALESCE(Country,'|')
) as Base_Matching_Text,
FARM_FINGERPRINT
(
CONCAT(
COALESCE(Name,''),
COALESCE(Address,''),
COALESCE(Address2,''),
COALESCE(City,''),
COALESCE(State,''),
COALESCE(Zip,''),
COALESCE(Country,'')
)
) as AHashKey
FROM
(
SELECT
REPLACE(REPLACE(regexp_replace(lower(C_NAME),r"[^a-zA-Z0-9\s]+",''),chr(10),'') ,chr(13),'') as Name,
TRIM(regexp_replace(LOWER(C_STREET),r"[^a-zA-Z0-9\s]+",'')) as Address,
TRIM(regexp_replace(LOWER(C_STREET_2),r"[^a-zA-Z0-9\s]+",'')) as Address2,
LOWER(C_CITY) as City,
LOWER(C_STATE) as State,
LOWER(cast(C_ZIP as string)) as Zip,
LOWER(C_COUNTRY) as Country,
LOWER(mobile) as Mobile
FROM `myproject.mydataset.CustomerA` a
)
);
Cleaning Target Table B with below Stored Procedure
//Simple Data Cleaning for Table B
DROP TABLE IF EXISTS `myproject.mydataset.CustomerB_CleanedData_1`;
CREATE or REPLACE TABLE `myproject.mydataset.CustomerB_CleanedData_1`
as
SELECT
ROW_NUMBER () OVER (ORDER BY BHashKey) rnum,
Name,
RTRIM(ARRAY_TO_STRING(`myproject.mydataset.clean_sort_tokenize_pipe`(Base_Matching_Text),'')) Base_Matching_Text_Sorted,
Address,
Address2,
City,
State,
Zip,
Country,
BHashKey
FROM (
SELECT
*,
CONCAT(
COALESCE(Name,''),'|',
COALESCE(Address,''),'|',
COALESCE(Address2,''),'|',
COALESCE(City,''),'|',
COALESCE(State,''),'|',
COALESCE(Zip,'|'),'|',
COALESCE(Country,'|')
) as Base_Matching_Text,
FARM_FINGERPRINT(
CONCAT(
COALESCE(Name,''),
COALESCE(Address,''),
COALESCE(Address2,''),
COALESCE(City,''),
COALESCE(State,''),
COALESCE(Zip,''),
COALESCE(Country,'')
)
) as BHashKey
FROM
(
SELECT
REPLACE(REPLACE(regexp_REPLACE(LOWER(C_NAME),r"[^a-zA-Z0-9\s]+",''),chr(10),'') ,chr(13),'') as Name,
TRIM(regexp_REPLACE( LOWER(C_STREET),r"[^a-zA-Z0-9\s]+",'')) as Address,
TRIM(regexp_REPLACE(LOWER(C_STREET_2),r"[^a-zA-Z0-9\s]+",'')) as Address2,
LOWER(C_CITY) as City,
LOWER(C_STATE) as State,
LOWER(cast(C_ZIP as string)) as Zip,
LOWER(C_COUNTRY) as Country
FROM `myproject.mydataset.CustomerB` a
)
);
2. Creating Embedding on Prepared data
Step 1 : Create an BigQuery remote model external connection
Step 2 : Add IAM permissions to your BigQuery Service account as below :
Step 3 : Create a Remote Model(use stable embedding models)
CREATE OR REPLACE MODEL `myproject.mydataset.textembedding004model`
REMOTE WITH CONNECTION `projects/myproject/locations/us/connections/address_embedding_conn`
OPTIONS (ENDPOINT = 'text-embedding-004');
Step 4 : Create Embedding for Table A
DROP TABLE IF EXISTS `myproject.mydataset.CustomerA_Embeddings_2`;
CREATE TABLE
`myproject.mydataset.CustomerA_Embeddings_2` AS
SELECT
*
FROM
ML.GENERATE_TEXT_EMBEDDING( MODEL `myproject.mydataset.textembedding004model`,
(
SELECT
AHashKey,
Base_Matching_Text_Sorted AS content
FROM
`myproject.mydataset.CustomerA_CleanedData_1`
));
Step 5 : Create Embedding for Table B
DROP TABLE IF EXISTS `myproject.mydataset.CustomerB_Embeddings_2`;
CREATE TABLE
`myproject.mydataset.CustomerB_Embeddings_2` AS
SELECT
*
FROM
ML.GENERATE_TEXT_EMBEDDING( MODEL `myproject.mydataset.textembedding004model`,
(
SELECT
BHashKey,
Base_Matching_Text_Sorted AS content
FROM
`myproject.mydataset.CustomerB_CleanedData_1`
));
Step 6 : Get Cosine Distance
CREATE OR REPLACE TABLE `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1`
AS
WITH
mld as
(
SELECT
ROW_NUMBER() OVER(PARTITION BY AHashkey ORDER BY AHashkey,distance ASC) AS row_num,*
FROM
(
SELECT
s.AHashkey,
c.BHashkey,
ML.DISTANCE(s.text_embedding,c.text_embedding,'COSINE') AS distance
FROM
`myproject.mydataset.CustomerA_Embeddings_2` s,
`myproject.mydataset.CustomerB_Embeddings_2` c
)
)
SELECT * FROM mld WHERE row_num <=2;
In this query, only top 2 matching records with minimum cosine distance is considered. As per your requirement you can change the parameter.
Step 7 : Create a unified table with top n priority records(n=2 , in this case)
DROP TABLE IF EXISTS `myproject.mydataset.CustomerA_CustomerB_MDM`;
CREATE or REPLACE TABLE `myproject.mydataset.Customer_A_CustomerB_MDM`
as
SELECT
A.AHashkey as AHashkey ,
B.BHashkey as BHashkey ,
dc1.distance as B_distance_rank1,
1 as Priority,
A.Name as A_Name,
B.Name as B_Name,
A.Address as A_Address,
A.Address2 as A_Address2,
A.City as A_City,
A.State as A_State,
A.Zip as A_Zip,
A.Country as A_Country,
B.Address as B_Address,
B.Address2 as B_Address2,
B.City as B_City,
B.State as B_State,
B.Zip as B_Zip,
B.Country as B_Country,
FROM `myproject.mydataset.CustomerA_CleanedData_1` A
INNER JOIN `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` dc1 ON dc1.AHashKey=A.AHashkey AND dc1.row_num=1
INNER JOIN `myproject.mydataset.CustomerB_CleanedData_1` b ON dc1.BHashKey=B.BHashkey AND dc1.row_num=1
WHERE dc1.distance <=0.2
UNION ALL
SELECT
A.AHashkey as AHashkey ,
B.BHashkey as BHashkey ,
dc1.distance as B_distance_rank1,
2 as Priority,
A.Name as A_Name,
B.Name as B_Name,
A.Address as A_Address,
A.Address2 as A_Address2,
A.City as A_City,
A.State as A_State,
A.Zip as A_Zip,
A.Country as A_Country,
B.Address as B_Address,
B.Address2 as B_Address2,
B.City as B_City,
B.State as B_State,
B.Zip as B_Zip,
B.Country as B_Country,
FROM `myproject.mydataset.CustomerA_CleanedData_1` A
INNER JOIN `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` dc1 ON dc1.AHashKey=A.AHashkey AND dc1.row_num=2
INNER JOIN `myproject.mydataset.CustomerB_CleanedData_1` b ON dc1.BHashKey=B.BHashkey AND dc1.row_num=2
WHERE dc1.distance <=0.2;
Note : You can play with a cut-off threshold value for distance for desired results, in this case its 0.2
3. Compare the Final results
SELECT
A_Name,B_Name,Priority,
CONCAT(
COALESCE(A_Address,''),'|',
COALESCE(A_Address2,''),'|',
COALESCE(A_City,''),'|',
COALESCE(A_State,''),'|',
COALESCE(A_Zip,'|'),'|',
COALESCE(A_Country,'|')) A_Address,
CONCAT(
COALESCE(B_Address,''),'|',
COALESCE(B_Address2,''),'|',
COALESCE(B_City,''),'|',
COALESCE(B_State,''),'|',
COALESCE(B_Zip,'|'),'|',
COALESCE(B_Country,'|')) B_Address
FROM `myproject.mydataset.Customer_A_CustomerB_MDM`
WHERE A_name LIKE 'google%' ORDER BY AHashkey,Priority
As per our goal, in this final output every record from Table A has 2 nearest matching records from Table B, this can help to create a golden record for any Master Data Management.Finding missing data from a reference table on similarity.Solving use cases for data patching etc.
Conclusion
BigQuery Machine Learning empowers you to conquer the string matching challenge, even when traditional key-based joins fall short. By harnessing the power of string matching and BQML’s capabilities, you can unlock valuable insights hidden within your data, enabling more comprehensive and accurate analysis.
Remember, the key to success lies in experimentation, iteration, and fine-tuning your approach to achieve optimal results for your specific use case. Embrace the power of BQML, and let your data tell its story, even when the keys are elusive!
Feel free to reach out on LinkedIn and Medium, if you have any specific questions or want to explore advanced techniques for string matching in BigQuery ML!