String Matching with BigQuery Machine Learning: A Deep Dive into Solving the No-Matching-Key Puzzle

Published in

Google Cloud - Community

6 min readSep 5, 2024

In the dynamic world of data analysis, joining tables based on common keys is a fundamental operation. But what happens when the keys you need are misaligned or simply don’t exist? This is a common scenario when dealing with real-world data, where variations in naming conventions, typos, and incomplete information can hinder your analysis.

In this comprehensive guide, we will delve into the powerful capabilities of BigQuery Machine Learning (BQML) to address the string matching problem, even when your tables lack those convenient matching keys. We’ll walk through the process step-by-step, providing insights and considerations to help you achieve accurate and efficient string matching.

The Challenge of Disparate Keys

Imagine you have two tables and have no keys to join :

Table A -Master(Customer Address): This table contains valuable customer master details like name, address,mobile etc.

CREATE TABLE `myproject.mydataset.CustomerA`
(
C_NAME STRING,
C_NAME_DISPLAY STRING,
C_URL STRING,
C_STREET STRING,
C_STREET_2 STRING,
C_CITY STRING,
C_STATE STRING,
C_ZIP INT64,
C_COUNTRY STRING,
Mobile STRING
);

Consider the above 3 records in Table A of a customer which has scenarios like full address,partial address,missing address or variation in customer name etc.

Table B -Multiple Address(Customer Address): This table captures multiple latest address updated data by users, including customer_name, address and other relevant information.

CREATE TABLE `myproject.mydataset.CustomerB`
(
C_NAME STRING,
C_NAME_DISPLAY STRING,
C_URL STRING,
C_STREET STRING,
C_STREET_2 STRING,
C_CITY STRING,
C_STATE STRING,
C_ZIP INT64,
C_COUNTRY STRING
);

In the above reference Table B, the same customer has the 2 correct updated records of address

Goal is to combine these tables to find the nearest matching address record. However, a roadblock emerges: There is no key exists in both the tables. Also Data discrepancy can arise from various factors:

Name Variations: A customer might be listed as “John Smith” in one table and “John A. Smith” or even “J. Smith” in the other.
Typos: Human error can introduce misspellings or other inaccuracies in the data entry process.
Incomplete Information: One table might contain only partial customer names or nicknames or addresses.

Enter BQML: Text Embedding & Cosine Distance to the Rescue

BigQuery Machine Learning equips you with sophisticated tools to overcome this challenge:

1.Data Preparation: Laying the Foundation

Cleaning data for Table A and Table B with below simple methods :

Converting string to lowercase
Removing special characters
Concatenation & Sorting the search column groups
Assigning a HashKey for every record for uniqueness( Will be used for joining both tables for final result)
Assigning a RowNumber as a Unique Key

Functions used for Data Cleaning

CREATE OR REPLACE FUNCTION `myproject.mydataset.clean_tokenize_pipe`(a STRING) AS (
ARRAY(
  SELECT
    tokens
  FROM
    UNNEST(
      SPLIT(
        
          a
          , '|'
      )
    ) tokens
)
);

CREATE OR REPLACE FUNCTION `myproject.mydataset.clean_sort_tokenize_pipe`(a STRING) AS (
ARRAY(
 SELECT
   x
 FROM
   UNNEST(`myproject.mydataset.clean_tokenize_pipe`(a))
 AS x
 ORDER BY x)
);

Cleaning Source Table A with below procedure

//Simple Data Cleaning for Table CustomerA

DROP TABLE if exists `myproject.mydataset.CustomerA_CleanedData_1`;
CREATE or REPLACE TABLE `myproject.mydataset.CustomerA_CleanedData_1` 

as

SELECT 
ROW_NUMBER () OVER (ORDER BY AHashKey) rnum,
Name,
RTRIM(ARRAY_TO_STRING(`myproject.mydataset.clean_sort_tokenize_pipe`(Base_Matching_Text),'')) Base_Matching_Text_Sorted,
Address,
Address2,
City,
State,
Zip,
Country,
AHashKey
FROM (

        SELECT *,
                CONCAT(
                       COALESCE(Name,''),'|',
                       COALESCE(Address,''),'|',
                       COALESCE(Address2,''),'|',
                       COALESCE(City,''),'|',
                       COALESCE(State,''),'|',
                       COALESCE(Zip,'|'),'|',
                       COALESCE(Country,'|')
                     ) as Base_Matching_Text,
       
            FARM_FINGERPRINT
                    (
                      CONCAT(
                              COALESCE(Name,''),
                              COALESCE(Address,''),
                              COALESCE(Address2,''),
                              COALESCE(City,''),
                              COALESCE(State,''),
                              COALESCE(Zip,''),
                              COALESCE(Country,'')
                            )
                    ) as AHashKey

          FROM
          (
        SELECT
        REPLACE(REPLACE(regexp_replace(lower(C_NAME),r"[^a-zA-Z0-9\s]+",''),chr(10),'') ,chr(13),'') as Name,
        TRIM(regexp_replace(LOWER(C_STREET),r"[^a-zA-Z0-9\s]+",'')) as  Address,
        TRIM(regexp_replace(LOWER(C_STREET_2),r"[^a-zA-Z0-9\s]+",'')) as Address2,
        LOWER(C_CITY) as City,
        LOWER(C_STATE) as State,
        LOWER(cast(C_ZIP as string)) as Zip,
        LOWER(C_COUNTRY) as Country,
        LOWER(mobile) as Mobile
        FROM `myproject.mydataset.CustomerA` a
    )
  );

Cleaning Target Table B with below Stored Procedure

//Simple Data Cleaning for Table B

DROP TABLE IF EXISTS `myproject.mydataset.CustomerB_CleanedData_1`;
CREATE or REPLACE TABLE `myproject.mydataset.CustomerB_CleanedData_1` 

as

SELECT 
ROW_NUMBER () OVER (ORDER BY BHashKey) rnum,
Name,
RTRIM(ARRAY_TO_STRING(`myproject.mydataset.clean_sort_tokenize_pipe`(Base_Matching_Text),'')) Base_Matching_Text_Sorted,
Address,
Address2,
City,
State,
Zip,
Country,
BHashKey
FROM (

        SELECT 
              *,
            
              CONCAT(
                      COALESCE(Name,''),'|',
                      COALESCE(Address,''),'|',
                      COALESCE(Address2,''),'|',
                      COALESCE(City,''),'|',
                      COALESCE(State,''),'|',
                      COALESCE(Zip,'|'),'|',
                      COALESCE(Country,'|')
              ) as Base_Matching_Text,

              FARM_FINGERPRINT(
              CONCAT(
                      COALESCE(Name,''),
                      COALESCE(Address,''),
                      COALESCE(Address2,''),
                      COALESCE(City,''),
                      COALESCE(State,''),
                      COALESCE(Zip,''),
                      COALESCE(Country,'')
                    )
              ) as BHashKey

          FROM
          (
                SELECT
                        REPLACE(REPLACE(regexp_REPLACE(LOWER(C_NAME),r"[^a-zA-Z0-9\s]+",''),chr(10),'') ,chr(13),'') as Name,
                        TRIM(regexp_REPLACE(  LOWER(C_STREET),r"[^a-zA-Z0-9\s]+",'')) as  Address,
                        TRIM(regexp_REPLACE(LOWER(C_STREET_2),r"[^a-zA-Z0-9\s]+",'')) as Address2,
                        LOWER(C_CITY) as City,
                        LOWER(C_STATE) as State,
                        LOWER(cast(C_ZIP as string)) as Zip,
                        LOWER(C_COUNTRY) as Country
        
                FROM `myproject.mydataset.CustomerB` a
    )
  );

2. Creating Embedding on Prepared data

Step 1 : Create an BigQuery remote model external connection

Step 2 : Add IAM permissions to your BigQuery Service account as below :

Step 3 : Create a Remote Model(use stable embedding models)

CREATE OR REPLACE MODEL `myproject.mydataset.textembedding004model`
REMOTE WITH CONNECTION `projects/myproject/locations/us/connections/address_embedding_conn`
OPTIONS (ENDPOINT = 'text-embedding-004');

Step 4 : Create Embedding for Table A

DROP TABLE IF EXISTS `myproject.mydataset.CustomerA_Embeddings_2`;
CREATE TABLE
  `myproject.mydataset.CustomerA_Embeddings_2` AS
SELECT
  *
FROM
  ML.GENERATE_TEXT_EMBEDDING( MODEL `myproject.mydataset.textembedding004model`,
    (
    SELECT
      AHashKey,
      Base_Matching_Text_Sorted AS content
    FROM
      `myproject.mydataset.CustomerA_CleanedData_1`

 ));

Step 5 : Create Embedding for Table B

DROP TABLE IF EXISTS `myproject.mydataset.CustomerB_Embeddings_2`;
CREATE TABLE
  `myproject.mydataset.CustomerB_Embeddings_2` AS
SELECT
  *
FROM
  ML.GENERATE_TEXT_EMBEDDING( MODEL `myproject.mydataset.textembedding004model`,
    (
    SELECT
      BHashKey,
      Base_Matching_Text_Sorted AS content
    FROM
      `myproject.mydataset.CustomerB_CleanedData_1`
   
 ));

Step 6 : Get Cosine Distance

CREATE OR REPLACE TABLE `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` 
AS  

WITH
mld as 
(
    SELECT
      ROW_NUMBER() OVER(PARTITION BY AHashkey ORDER BY AHashkey,distance ASC) AS row_num,*
         FROM 
         (

              SELECT
                s.AHashkey,
                c.BHashkey,
                ML.DISTANCE(s.text_embedding,c.text_embedding,'COSINE') AS distance
              FROM
                `myproject.mydataset.CustomerA_Embeddings_2` s,
                `myproject.mydataset.CustomerB_Embeddings_2` c

          ) 

)
SELECT * FROM mld WHERE row_num <=2;

In this query, only top 2 matching records with minimum cosine distance is considered. As per your requirement you can change the parameter.

Step 7 : Create a unified table with top n priority records(n=2 , in this case)

DROP  TABLE IF EXISTS `myproject.mydataset.CustomerA_CustomerB_MDM`;
CREATE or REPLACE TABLE `myproject.mydataset.Customer_A_CustomerB_MDM`
as
SELECT
A.AHashkey as AHashkey ,
B.BHashkey as BHashkey ,
dc1.distance             as B_distance_rank1,
1 as Priority,
A.Name                   as A_Name,
B.Name                   as B_Name,   
A.Address                as A_Address,
A.Address2               as A_Address2,
A.City                   as A_City,
A.State                  as A_State,
A.Zip                    as A_Zip,
A.Country                as A_Country,
B.Address                as B_Address,
B.Address2               as B_Address2,
B.City                   as B_City,
B.State                  as B_State,
B.Zip                    as B_Zip,
B.Country                as B_Country,
FROM `myproject.mydataset.CustomerA_CleanedData_1` A
INNER JOIN `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` dc1 ON dc1.AHashKey=A.AHashkey AND dc1.row_num=1
INNER JOIN `myproject.mydataset.CustomerB_CleanedData_1`                     b   ON dc1.BHashKey=B.BHashkey AND dc1.row_num=1
WHERE dc1.distance <=0.2

UNION ALL

SELECT
A.AHashkey as AHashkey ,
B.BHashkey as BHashkey ,
dc1.distance             as B_distance_rank1,
2 as Priority,
A.Name                   as A_Name,   
B.Name                   as B_Name,
A.Address                as A_Address,
A.Address2               as A_Address2,
A.City                   as A_City,
A.State                  as A_State,
A.Zip                    as A_Zip,
A.Country                as A_Country,
B.Address                as B_Address,
B.Address2               as B_Address2,
B.City                   as B_City,
B.State                  as B_State,
B.Zip                    as B_Zip,
B.Country                as B_Country,
FROM       `myproject.mydataset.CustomerA_CleanedData_1` A
INNER JOIN `myproject.mydataset.data_matching_CustomerA_CustomerB_Cluster_1` dc1 ON dc1.AHashKey=A.AHashkey AND dc1.row_num=2
INNER JOIN `myproject.mydataset.CustomerB_CleanedData_1` b ON dc1.BHashKey=B.BHashkey AND dc1.row_num=2
WHERE dc1.distance <=0.2;

Note : You can play with a cut-off threshold value for distance for desired results, in this case its 0.2

3. Compare the Final results

SELECT 
A_Name,B_Name,Priority,
CONCAT(
        
        COALESCE(A_Address,''),'|',
        COALESCE(A_Address2,''),'|',
        COALESCE(A_City,''),'|',
        COALESCE(A_State,''),'|',
        COALESCE(A_Zip,'|'),'|',
        COALESCE(A_Country,'|')) A_Address,
CONCAT(
        
        COALESCE(B_Address,''),'|',
        COALESCE(B_Address2,''),'|',
        COALESCE(B_City,''),'|',
        COALESCE(B_State,''),'|',
        COALESCE(B_Zip,'|'),'|',
        COALESCE(B_Country,'|')) B_Address

FROM `myproject.mydataset.Customer_A_CustomerB_MDM` 
WHERE A_name LIKE 'google%' ORDER BY AHashkey,Priority

Final Output with filling the nearest record

As per our goal, in this final output every record from Table A has 2 nearest matching records from Table B, this can help to create a golden record for any Master Data Management.Finding missing data from a reference table on similarity.Solving use cases for data patching etc.

Conclusion

BigQuery Machine Learning empowers you to conquer the string matching challenge, even when traditional key-based joins fall short. By harnessing the power of string matching and BQML’s capabilities, you can unlock valuable insights hidden within your data, enabling more comprehensive and accurate analysis.

Remember, the key to success lies in experimentation, iteration, and fine-tuning your approach to achieve optimal results for your specific use case. Embrace the power of BQML, and let your data tell its story, even when the keys are elusive!

Feel free to reach out on LinkedIn and Medium, if you have any specific questions or want to explore advanced techniques for string matching in BigQuery ML!