Data augmentation with BigQuery and Google Knowledge Graph

Olejniczak Lukasz
Google Cloud - Community
12 min readDec 21, 2022

Seems like everybody is talking about Large Language Models these days and most people are impressed with how well they can use the words. However, it did not require a lot of effort to see that the generated content very loosely adheres to the facts. Depending on the context of use, some will simply call them fiction that is pleasant to read, others will classify them as hallucinations. Do you ever have the feeling that the number of the latter is growing faster than the number of sources providing facts on which we can realistically build our knowledge and base our decisions? This situation resembles me the old joke shared by recruiters:

One of the trusted sources of facts that is worth mentioning is Google Knowledge Graph. It contains billions of entries that describe real-world entities like people, places and things. These entities and relations between them form the nodes and edges of a huge knowledge graph.

Source: A Layman’s Visual Guide To Google’s Knowledge Graph Search API

Google Knowledge Graph contains master versions of entities created after reconciling and merging data from multiple sources. Every such a master entity has a globally unique identifier called machine ID or MID:

Source: https://cloud.google.com/enterprise-knowledge-graph/docs/mid

In order to search or look up this graph you can use Google Knowledge Graph Search API that is available on Google Cloud.

You can choose between one of the two editions: Basic and Advanced. The Basic edition is optimized for simplicity. It’s a good fit for community applications that aren’t high-traffic use cases while advanced edition is optimized for enterprise and production use cases (for instance has increased QPS quota).

In this article we will show how to use Google Knowledge Graph Search API to augment (enrich) data you have in BigQuery.

Context:

Let’s assume that we are a global digital marketplace that sells game keys with instant delivery 24/7 — similarly to what our friends from Kinguin.net do.

We want to please our users and provide the most accurate game descriptions.

Let’s assume that we would like to augment our internal database with details collected from https://www.igdb.com/ (IGDB), publicly available database operated by https://www.twitch.tv.

One of the challenges is that there is no standardization when it comes to game titles and different sources use slightly different titles to represent the same game. As a result joining such datasets is not trivial. In fact this problem is valid not just for games but happens across all industries.

So how can we address it? One option is to look up Google’s Enterprise Knowledge Graph and map game titles from our internal and IGDB databases to globally unique MID (Knowledge Graph Machine ID). This MID can then be used to join the two datasets. We can also use this solution to fetch additional attributes that are available only in Knowledge Graph. In other words — we will use Google’s Knowledge Graph as master data to standardize and augment our data sets.

High level plan:

Lets baseline what we have:

  • Internal database of games as BigQuery table
  • IGDB database of games loaded into BigQuery table

First step is to extend both sets with additional column corresponding to MID from Google Knowledge Graph:

To communicate with Google Knowledge Graph we will use Knowledge Graph Search API. Given we have our data already in BigQuery we can call this API using BigQuery remote function. Remote function can be used in SQL queries as any other user defined function with the only difference that behind the scene it will call Cloud Function where we code all the logic needed to communicate with Knowledge Graph Search API, look up the graph and return results (at least MID) to BigQuery:

You may want to contact GCP technical support to increase the limits on the number of queries to Knowledge Graph Search API (by default: 60 queries per minute).

We can apply this mechanism to both: table representing our internal database and table representing IGDB. When looking up knowledge graph, search API will make use of any combination of fuzzy text, common relationships, entity types to calculate semantic similarity and as a result the same game but with slightly different titles will be mapped to the same MID.

It is worth mentioning that we could also consider another functionality available on Google Cloud to find semantic similarity between records, namely Entity Reconciliation API:.

With Entity Reconciliation we are able to define Entity Reconciliation Jobs that read data directly from BigQuery tables and given some configuration that maps input table columns to attributes of one of a supported entity types these jobs can handle the following:

  • knowledge extraction to turn the input relational data into a knowledge graph representation (RDF triples).
  • Google Entity Reconciliation engine then builds a graph to cluster entities into groups (entities in the same group — with the same Cluster ID — are considered as matched).
  • The last step is to output the linking result (matched and not-matched) to your BigQuery tables in the form of a new unique identifier column (MID).
Source: https://cloud.google.com/enterprise-knowledge-graph/docs/overview#how_does_it_work

Entity Reconciliation would normally be our first choice but for the moment it only supports the following entity types : Person, Organization and LocalBusiness, while we need VideoGame. Fortunately, additional entity types are expected soon.

When it comes to entities supported by Google Knowledge Graph API the list is much longer and already includes VideoGame. Here are other types you will also find:

Let get our hands dirty and work with Google Knowledge Graph Search API!

Setup Outline:

  1. Start by enabling the appropriate APIs:
  • Enterprise Knowledge Graph API
  • Cloud Functions API
  • Cloud Build API
  • Logging and Monitoring APIs
  • BigQuery API
  • BigQuery Connection API (allows users to use BigQuery to connect to external data sources and APIs)

2. Create service account for Cloud Function

We need to write and deploy a Cloud Function that will be triggered from BigQuery job when we run SQL that uses our remote function, and it will then call Knowledge Graph Search API.

Our Cloud Function will be represented by service account, which must have permissions to call Knowledge Graph Search API.

In order to create such a service account, navigate to the IAM & Admin > Service Accounts page, click on the “+ Create Service Account” button and provide a name and an ID.

When the service account is created, we can assign it the predefined role named Enterprise Knowledge Graph Viewer.

This role aggregates all the permissions needed to use Knowledge Graph Search API, specifically:

enterpriseknowledgegraph.cloudknowledgegraphentities.search

3. Create Cloud Function

Go to Cloud Functions dashboard in Google Cloud Console and click Create button. Give your function a name that satisfies your naming conventions and keep the remaining settings to their defaults:

  • require authentication,
  • memory and timeout values
  • all the connection and security parameters.

For improved performance, deploy your Cloud Function to the same region as the region of your BigQuery datasets.

The only important change is to modify a runtime service account by replacing the default one with the one we created in the previous step.

Click “Next” to finalize this step and start coding communication with Google Knowledge Graph Search API.

To start, switch runtime environment to Python 3.7, and Google will generate skeleton for entry point that is able to handle HTTP requests:

Lets do some changes. First rename entry point from hello_world to ckg_lookup. Then open requirements.txt file and add these two entries:

Finally modify main.py and use the following code which will authenticate service account assigned to our Cloud Function and use its credentials to obtain access token which will then be used to authorize requests to Google Knowledge Graph Search API:

At this moment we have basic Cloud Function that is expecting HTTP requests.

Our Cloud Function will be triggered from BigQuery when we run SQL query referencing remote function. Therefore it is a good moment to jump to BigQuery console.

4. Import game titles into BigQuery table

In this article we want to augment or enrich our gaming data with at least MID attribute from Google Knowledge Graph. It means we need BigQuery table that will work as our gaming dictionary.

Just for this article, lets create BigQuery table with just one column named title. Tables in BigQuery always belong to some dataset so if you do not have BigQuery dataset in your project then create one. Ours is called: games_info

To begin, we should navigate to the BigQuery interface and import our data.

Download the following csv file to instantiate your gaming dictionary:

And then create new BigQuery table from that CSV file:

Now, when you run simple SELECT on this table you should see … just a list of imported titles:

5. Create BigQuery remote function

BigQuery remote function will enable you to call Cloud Function from your SQL commands. Every BigQuery remote function needs three elements:

  • BigQuery remote connection object — you can think of it as a bridge that connects BigQuery with remote resources like Cloud Function or Cloud Run
  • Service Account with permissions necessary to represent our remote function when calling/invoking Cloud Function
  • Function definition that provides specification of function inputs and outputs and links remote connection object with Cloud Function or Cloud Run endpoint/url.

5.1 Create BigQuery remote connection object

In order to create BigQuery remote connection object we need to switch to cloud shell. Cloud shell is equipped with all GCP utilities including bq that is needed to run the following command:

Now, when we go back to BigQuery workspace and expand External connections tree we should see our newly created remote connection object: games_ekg_lookup. Double click it to list its properties:

From our point of view the most important property is Service account Id.

It means that when we asked BigQuery to create new remote connection object it also created a new service account for this connection. Now we need to make sure it has necessary permissions to invoke our Cloud Function.

5.2 Set up permissions to invoke Cloud Function

Service account created for BigQuery remote connection object needs the cloudfunctions.functions.invoke permission. We can satisfy this requirement by assigning the role of Cloud Functions Invoker:

5.3 Create BigQuery remote function

To create a remote function, execute the following CREATE FUNCTION statement. Not everybody can create remote functions. You need to have the permission bigquery.routines.create on the dataset where you create the remote function, and bigquery.connections.delegate permission. (available from the BigQuery Connection Admin role) on the connection which is used by the remote function.

Remote function definition specifies:

  • input and output attributes — please note that our function returns JSON which means we will take advantage of the fact that BigQuery has native support for this data type
  • remote connection object to be used to invoke Cloud Function
  • Cloud Function endpoint/url
  • So-called user-defined context which is a way to provide a list of key:value pairs that can be used when coding Cloud Function logic. We will use this property let Google Knowledge Search API know that we want to just focus on entities of type: VideoGame

In order to get URL of you Cloud Function URL go to Cloud Functions Dashboard in Google Cloud Console, click the name of your function and go to TRIGGER tab. The value of TRIGGER_URL property is the one you want to use in BigQuery remote function definition.

6. Update Cloud Function logic

If everything works as expected, running SQL query with our remote function games_ekg_lookup should invoke Cloud Function.

Lets checkpoint this. Go to BigQuery workspace and run something like that:

Now go to Cloud Functions Dashboard in Google Cloud Console, click the name of your function and go to LOGS tab. You should see new entry:

OK it proves that all parts are integrated. In step 3 we created Cloud Function but have not yet coded integration with Google Knowledge Graph Search API. The corresponding logic will consist of three steps:

  • Read request sent by BigQuery remote function
  • Communicate with Google Knowledge Graph Search API to look up VideoGame entities using game title
  • Return result to BigQuery

To properly code this steps we need to better understand what we receive from BigQuery, what is the expected request body by Google Knowledge Graph Search API and what message body is expected by BigQuery.

Lets start from diving deeper into what we receive from BigQuery. Question is where we can catch it. One option is to use the same process as we used to checkpoint our integration — namely send SQL query with remote function and see what is logged in Cloud Logging.

If you do so, you will see the following request:

The elements of the array named “calls” are inputs to our remote function (game titles) and correspond to rows from title column of our BigQuery table. Inputs are sent in batches: single request will in general include multiple rows — or multiple elements in calls array.

userDefinedContext attribute is set to VideoGame which is no surprise because this is the value we set when defining remote function in BigQuery (step 5.3) .

When it comes to what message format is expected by BigQuery in return, here is an example:

It expects an array named replies which consists of JSON objects.

Now given we know what to expect as input and what we should return to BigQuery we need to modify our Cloud Function with code to consume input request from BigQuery (Line 3), prepare request to Google Knowledge Graph Search API (Lines 6–16), iterate through collection of values from input request and use them as lookup key in requests to Google Knowledge Graph Search API (Lines 17–22).

Lines 28–29 build JSON response with collection of objects fetched from Knowledge Graph. For now — this collection is empty because we would like to add few words on what is returned by Knowledge Graph Search API.

When calling the EKG API, we receive a lot of information back. From multiple identification numbers to short and long descriptions or associated images. Full schema is explained here.

In our example, we want to extract just some subset of available properties, namely: game descriptions, logos, googleKg MID (Machine ID) for future reference, and the original name for validation purposes:

The entire main.py file should look like this.

Our Cloud Function is ready! Press deploy button and let Google handle the rest for us.

7. Query EKG from BigQuery

We are now ready to go back to BigQuery workspace and use our BigQuery remote function to look up Google Knowledge Graph and augment game titles with Knowledge Graph MID, corresponding game description and url of its image. Run the following query:

And here come the results!

This article is authored by Lukasz Olejniczak — Customer Engineer at Google Cloud. The views expressed are those of the authors and don’t necessarily reflect those of Google.

Please clap for this article if you enjoyed reading it. For more about google cloud, data science, data engineering, and AI/ML follow me on LinkedIn.

--

--