Master Data Management Simplified: Match & Merge with Generative AI!

Abirami Sukumaran
Google Cloud - Community
12 min readApr 16, 2024

It is my dream to accelerate some of the tedious MDM processes with Generative AI, Embeddings, Vector Search and more… Read on.

Image showing a bike station with docked bikes to represent the use case

Introduction

At its core, Master Data Management (MDM) is about creating a single, reliable source of truth for your organization’s most critical data. Think of MDM as a meticulously curated library where every book (data point) is correctly labeled, up-to-date, and easy to find. Implementing robust MDM has always been a strategic necessity, but it often comes with complexities and resource demands. This is where the transformative power of Generative AI, particularly models like Gemini 1.0 Pro, Gemini 1.0 Pro Vision, Gemini 1.5 Pro, comes into play.

In this article, we will demonstrate how Gemini 1.0 Pro simplifies master data management applications like enrichment and deduplication, for the citibike_stations data available in the BigQuery public dataset. For this, we will use:
1. BigQuery public dataset bigquery-public-data.new_york_citibike.
2. Gemini Function Calling (a Java Cloud Function that gets the address information using the reverse Geocoding API for the coordinates available with the citibike_stations data) that we have already created in one of our previous articles.
3. Vertex AI Embeddings API and Vector Search in BigQuery to identify duplicates.

Master Data Management (MDM): The Foundation for Data-Driven Decisions

The key elements of master data are business entities (like customers, products, suppliers, locations, etc. which are the nouns that your business revolves around), identifiers (Unique identifiers ensure each entity is distinct and traceable across systems), attributes (These are the characteristics that describe each entity e.g. a customer’s address, a product’s price etc.). I am going to take a little space here with it because understanding the importance of MDM is paramount in the generative AI era more than ever. Why MDM Matters:

  • Adds clarity to data:

Without MDM, businesses struggle with fragmented data scattered across various systems. This leads to inconsistencies, duplicates, and sometimes ambiguous understanding of relationships between data points (e.g. is “Abirami Sukumaran” in one system the same customer as “Abirami S.” in another?).

  • Helps in decision making:

Accurate, unified master data is the foundation of informed business decisions. It answers questions like — Who are our most valuable customers? Which products are underperforming? What suppliers provide the best value?

  • Serves data as a Product:

MDM is essential for treating data as a valuable asset. Clean, integrated, and up-to-date master data forms the basis for building insightful data products that drive business outcomes.

Let me paint a picture to help you understand master data better by comparing it with transactional data: Transactional data captures individual events (a purchase, a shipment etc.). Master data provides the context for those events by defining the entities involved. For example, a sales transaction links to master data for the customer, product, and salesperson. Without MDM, data tends to remain in silos hindering a holistic view, prone to quality issues and relatively heavy resource-spend on data reconciliation and consolidation on demand.

The powerful synergy of MDM and Generative AI

Below are some of the industry grade applications and advantages of empowering and simplifying MDM with Generative AI:

  • Automated Data Cleansing: Gemini 1.0 Pro can intelligently identify and rectify inconsistencies, duplicates, and errors in master data, significantly reducing manual effort.
  • Intelligent Matching: Leveraging its language understanding capabilities, Generative AI capabilities, embeddings and Vector Search, come together to accurately match and merge records from disparate sources, even with variations in formatting or naming conventions.
  • Contextual Enrichment: Generative AI models can generate missing attributes or provide additional context to master data, enhancing its value for analysis and decision-making.
  • Adaptive Learning: You can leverage fine tuning and reinforcement learning to enable generative AI to continuously improve its understanding of your business entities and relationships, leading to more accurate and efficient MDM over time.

High Level Flow Diagram

This diagram represents the flow of data and steps involved in the implementation.

High level flow of the use case

Demo

The steps loosely are as follows:

  1. Create a BigQuery dataset for the use case. Create a landing table with data from the public dataset table bigquery-public-data.new_york_citibike.citibike_stations.
  2. Make sure the Cloud Function that includes Gemini Function Calling for address standardization is deployed.
  3. Store the enriched address data in the landing tables (from 2 sources for demo purpose).
  4. Invoke Vertex AI Embeddings API from BigQuery on the address data.
  5. Use BigQuery Vector Search to identify duplicate records.

Setup

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
  3. You will use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. From the Cloud Console, click Activate Cloud Shell on the top right corner.
  4. Enable necessary APIs for this implementation if you haven’t already. Alternative to the gcloud command is through the console using this link.

Step 1: Create BigQuery Dataset and External Connection

BigQuery dataset is a container for all the tables and objects for your application. BigQuery connection is used to interact with your Cloud Function. In order to create a remote function, you must create a BigQuery connection. Let’s begin with creating the dataset and the connection.

  1. From the Google Cloud Console, go to the BigQuery page and click the 3 vertical dots icon next to your project id. From the list of options, select “Create data set”.
  2. In Create data set pop up, enter the data set ID “mdm_gemini” with the region set to the default value “US (multiple regions…)”
  3. BigLake Connection allows us to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API. We will use this connection to access the model from BigQuery via the Cloud Function. Follow steps below to create the BigLake Connection:

a. Click ADD on the Explorer pane of the BigQuery page:

ADD button to create connection

b. Click Connections to external data sources in the sources page.

c. Enter external data source details as below in the pop up that shows up and click CREATE CONNECTION:

Create Connection

d. Once the connection is created, go to the connection configuration page and copy the Service account ID for access provisioning:

Connection Info

e. Open IAM and admin page, click GRANT ACCESS, enter the service account id in the new principals tab and roles as shown below and click SAVE.

Grant Access

Step 2: Deploy Function Calling (Java Cloud Function)

  1. Clone the github repo from your Cloud Shell Terminal and change the YOUR_API_KEY and YOUR_PROJECT_ID with your values
git clone https://github.com/AbiramiSukumaran/GeminiFunctionCalling

2. Go to Cloud Shell terminal, navigate into the newly cloned project directory “GeminiFunctionCalling” and execute the below statement build and deploy the Cloud Function:

gcloud functions deploy gemini-fn-calling --gen2 --region=us-central1 --runtime=java11 --source=. --entry-point=cloudcode.helloworld.HelloWorld --trigger-http

The result for this deploy command would be a REST URL in the format as below :

https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling

3. Test this Cloud Function by running the following command from the terminal:

gcloud functions call gemini-fn-calling - region=us-central1 - gen2 - data '{"calls":[["40.714224,-73.961452"]]}'

Response for a random sample prompt:

  '{"replies":["{ \"DOOR_NUMBER\": \"277\", \"STREET_ADDRESS\": \"Bedford Ave\", \"AREA\":
null, \"CITY\": \"Brooklyn\", \"TOWN\": null, \"COUNTY\": \"Kings County\", \"STATE\":
\"NY\", \"COUNTRY\": \"USA\", \"ZIPCODE\": \"11211\", \"LANDMARK\": null}}```"]}'

The request and response parameters of this Cloud Function are implemented in a way that is compatible with BigQuery’s remote function invocation. It can be directly consumed from BigQuery data in-place. It means that if your data input (lat and long data) lives in BigQuery then you can call the remote function on the data and get the function response which can be stored or processed within BigQuery directly.

4. Run the following DDL from BigQuery to create a remote function that invokes this deployed Cloud Function:

 CREATE OR REPLACE FUNCTION 
`mdm_gemini.MDM_GEMINI` (latlng STRING) RETURNS STRING
REMOTE WITH CONNECTION `us.gemini-bq-conn`
OPTIONS (
endpoint = 'https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling', max_batching_rows = 1
);

WORKAROUND

If you do not have the necessary key or Cloud Function deployed, feel free to jump to the landing table directly by exporting the data from the csv into your new BigQuery dataset mdm_gemini using the following command in the Cloud Shell Terminal. Remember to download the file CITIBIKE_STATIONS.csv from the repo into your Cloud Shell project folder and navigate into that folder before executing the following command:

bq load --source_format=CSV --skip_leading_rows=1 mdm_gemini.CITIBIKE_STATIONS ./CITIBIKE_STATIONS.csv \ name:string,latlng:string,capacity:numeric,num_bikes_available:numeric,num_docks_available:numeric,last_reported:timestamp,full_address_string:string

Step 3: Create Table and Enrich Address data

a. If you have used the WORKAROUND approach from the last step, you can skip this step, since you have already created the table there. If NOT, proceed to the running the following DDL in BigQuery SQL Editor:

CREATE TABLE mdm_gemini.CITIBIKE_STATIONS as (
select name, latitude || ',' || longitude as latlong, capacity, num_bikes_available, num_docks_available,last_reported,
'' as full_address_string
from bigquery-public-data.new_york_citibike.citibike_stations) ;

Let’s enrich the address data by invoking the remote function on the latitude and longitude coordinates available in the table. Please note that we will update this only for data reported for the year 2024 and where number of bikes available > 0 and capacity > 100:

update `mdm_gemini.CITIBIKE_STATIONS` 
set full_address_string = `mdm_gemini.MDM_GEMINI`(latlong)
where EXTRACT(YEAR FROM last_reported) = 2024 and num_bikes_available > 0 and capacity > 100;

b. Do not skip this step even if you used the WORKAROUND approach in the last step. This is where we will create a second source of bike station location data for the purpose of this use case. Afterall, MDM is bringing data from multiple sources together and identifying the golden truth.

Run the following DDLs in BigQuery SQL Editor for creating the second source of location data with 2 records in it. Let’s name this table mdm_gemini.CITIBIKE_STATIONS_SOURCE2 and insert 2 records into it:

CREATE TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE2 (name STRING(55), address STRING(1000), embeddings_src ARRAY<FLOAT64>);

insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Location broadway and 29','{ "DOOR_NUMBER": "1593", "STREET_ADDRESS": "Broadway", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10019", "LANDMARK": null}', null);

insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Allen St & Hester','{ "DOOR_NUMBER": "36", "STREET_ADDRESS": "Allen St", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10002", "LANDMARK": null}', null);

Step 4: Generate Embeddings for Address Data

Embeddings are high-dimensional numerical vectors that represent a given entity, like a piece of text or an audio file. Machine learning (ML) models use embeddings to encode semantics about such entities to make it easier to reason about and compare them. For example, a common operation in clustering, classification, and recommendation models is to measure the distance between vectors in an embedding space to find items that are most semantically similar. The Vertex AI text-embeddings API lets you create a text embedding using Generative AI on Vertex AI. Text embeddings are numerical representations of text that capture relationships between words and phrases. Read more about Vertex AI Text Embeddings here.

Run the below DDL to create a remote model for Vertex AI text embeddings API:

CREATE OR REPLACE MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`
REMOTE WITH CONNECTION `us.gemini-bq-conn`
OPTIONS (ENDPOINT = 'textembedding-gecko@latest');

Now that the remote embeddings model is ready, let’s generate embeddings for the first source and store it in a table. You can store the embeddings result field in the same mdm_gemini.CITIBIKE_STATIONS table as before, but I am choosing to create a new one for clarity:

CREATE TABLE `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` AS (
SELECT *
FROM ML.GENERATE_EMBEDDING(
MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,
( select name, full_address_string as content from `mdm_gemini.CITIBIKE_STATIONS`
where full_address_string is not null )
)
);

Let’s generate embeddings for address data in table CITIBIKE_STATIONS_SOURCE2:

update `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` a set embeddings_src =
(
SELECT ml_generate_embedding_result
FROM ML.GENERATE_EMBEDDING(
MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,
( select name, address as content from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` ))
where name = a.name) where name is not null;
This should create embeddings for the second source, note that we have created the embeddings field in the same table CITIBIKE_STATIONS_SOURCE2.

To visualize the embeddings are generated for the source data tables 1 and 2, run the below queries:

select name,address,embeddings_src from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`;
select name,content,ml_generate_embedding_result from `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`;

Let’s go ahead and perform vector search to identify duplicates.

Step 5: Vector Search for Flagging Duplicate Addresses

In this step, we will search the address embeddings ml_generate_embedding_result column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` table for the top 2 embeddings that match each row of data in the embeddings_src column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` table:

select query.name name1,base.name name2, 
/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */
distance
from VECTOR_SEARCH(
TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,
'ml_generate_embedding_result',
(SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),
'embeddings_src',
top_k => 2
) where query.name <> base.name
order by distance desc;

Table that we are querying: mdm_gemini.CITIBIKE_STATIONS_EMBEDDINGS on the field ‘ml_generate_embedding_result’

Table that we use as base: mdm_gemini.CITIBIKE_STATIONS_SOURCE2 on the field ‘embeddings_src’

top_k: specifies the number of nearest neighbors to return. The default is 10. A negative value is treated as infinity, meaning that all values are counted as neighbors and returned.

distance_type: specifies the type of metric to use to compute the distance between two vectors. Supported distance types are EUCLIDEAN and COSINE. The default is EUCLIDEAN.

The result of the query is as follows:

Result Set

As you can see, we have listed 2 nearest neighbors (in other words, closest duplicates) for the 2 rows in CITIBIKE_STATIONS_SOURCE2 from CITIBIKE_STATIONS_SOURCE1. Since the distance_type is unspecified, it assumes that it is EUCLIDEAN and the distance is read as the distances in address TEXT values between the two sources, lowest being the most similar address texts.

Let’s set distance_type to COSINE:

select query.name name1,base.name name2, 
/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */
distance
from VECTOR_SEARCH(
TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,
'ml_generate_embedding_result',
(SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),
'embeddings_src',
top_k => 2,distance_type => 'COSINE'
) where query.name <> base.name
order by distance desc;

Result:

Result set

Both queries (of both distance types) are ordered by distance DESCENDING which means we want to list the results in the order of decreasing distance. But you will notice that the second query’s distance order is reversed. Guess why?

Yes!! You got it right! When we say COSINE, it returns the similarity. So bigger the number, closer the similarity and hence lesser the distance. In EUCLIDEAN, bigger the number, farther the distance of values (in this case text).

Tips to understand the difference and applications of EUCLIDEAN and COSINE:

Euclidean Distance measures the straight-line distance between two points in a multi-dimensional space and Cosine Similarity measures the cosine of the angle between two vectors.

Scale Invariance: Cosine similarity is not affected by the magnitude (length) of vectors, only their direction. This is useful when comparing documents of different lengths or user preferences with varying intensities.

Curse of Dimensionality: Euclidean distance can become less informative in very high-dimensional spaces, while cosine similarity tends to hold up better.

When to Use Which:
Similar Scale, Absolute Differences Matter: Euclidean distanceDifferent Scales, Direction Matters More: Cosine similarity

5. Example:
Imagine two users’ movie ratings:
User 1: [5, 4, 3] (action, comedy, drama)
User 2: [10, 8, 6] (same genres, but rated higher)

Euclidean distance would be large due to the difference in rating scale.
Cosine similarity would be high, indicating similar taste despite the rating difference.

Please note in our example, we have used location text similarity — it doesn’t mean we are tracing the absolute geocoding distance between the 2 locations by value. We are only finding their similarity by the address text values.

I built a UI around the Vector Search data from BigQuery for visualizing nearest neighbors based on their similarity in address string, NOT THE PHYSICAL DISTANCE, for a slightly larger dataset in a simple Java Spring Boot application and deployed it in Cloud Run. I created a sample app, edit the SQL in the controller class to match your distance results table:

A demo of Bike Station Location UI that lists stations with similar addresses (not actual distance)

Conclusion

This project has demonstrated the power of using Gemini 1.0 Pro and Function Calling in transforming a few MDM activities into simplified yet powerful, deterministic and reliable generative AI capabilities. Now that you know, feel free to identify other ways of implementing the same use case or other MDM functionalities. Are there datasets you could validate, information gaps you could fill, or tasks that could be automated with structured calls embedded within your generative AI responses? Here is the link to the repo and for further reading, refer to the documentation for Vertex AI, BigQuery Remote Functions, and Cloud Functions, Embeddings, Vector Search for more in-depth guidance.

--

--

Abirami Sukumaran
Google Cloud - Community

Developer Advocate Google. With 16+ years in data and software dev leadership, I’m passionate about addressing real world opportunities with technology.