Extracting Themes from Contents with Gemini Models in BigQuery

In today’s data-driven world, understanding the underlying themes in textual data is crucial for gaining insights and making informed decisions. BigQuery’s integration with advanced language models, such as Google’s Gemini models, provides a powerful toolset for extracting and analyzing themes from large datasets. This guide will walk you through setting up the environment, creating connections, registering a model, and performing theme extraction on repository descriptions from GitHub data.

Setting Up the Environment:

For more details on setting up the environment, refer to the official Google Cloud Blog and my previous article.

Enable BigQuery Connection API
Ensure that the BigQuery Connection API is enabled in your Google Cloud project.

Create an External Connection
Use the following bash command to create a connection:

bq mk --connection --location=us --project_id=[PROJECT_ID_MASKED] \
--connection_type=CLOUD_RESOURCE vertex_genai_conn

Retrieve the Service Account ID
After creating the connection, use this command to retrieve the service account ID:

bq show --connection [PROJECT_ID_MASKED].us.vertex_genai_connNote the returned service account ID:
bqcx-xxxxxx-qhiz@gcp-sa-bigquery-condel.iam.gserviceaccount.com

Grant Permissions
Grant the “Vertex AI User” role to the connection’s service account using the following command:

gcloud projects add-iam-policy-binding [PROJECT_ID_MASKED] \
--member='serviceAccount:bqcx-xxxxxx-qhiz@gcp-sa-bigquery-condel.iam.gserviceaccount.com' \
--role='roles/aiplatform.user'

Register the gemini-pro Model as a Remote Model in BigQuery

To leverage the advanced capabilities of Vertex AI’s gemini-pro model for language processing, you can register the model as a remote model in BigQuery. This integration allows you to perform sophisticated text analysis directly within the BigQuery environment. Follow the steps below to register the model:

CREATE OR REPLACE MODEL `[PROJECT_ID_MASKED].bqgenai.llm_model`
REMOTE WITH CONNECTION `[PROJECT_ID_MASKED].us.vertex_genai_conn`
OPTIONS (ENDPOINT = 'gemini-pro');

Reference:
Gemini 1.0 Pro Documentation

Introduction to the Dataset:

The dataset used in this example is sourced from the bigquery-public-data.samples.github_timeline table, which contains over 60,000 records of GitHub repository events. This dataset provides a timeline of actions such as pull requests and comments on GitHub repositories, organized in a flat schema. The dataset was created in May 2012 and can be accessed via the Datasets Link.

Each record includes a repository_description field that provides a brief summary of the repository. To demonstrate the capabilities of the Gemini model without excessive computational load, we will randomly sample 1000 records from this dataset for our theme extraction process.

Perform Theme Extraction
Use the following BigQuery SQL statements to randomly select 1000 rows and perform theme extraction:

-- Create a new table to store randomly selected 1000 rows
CREATE OR REPLACE TABLE `[PROJECT_ID_MASKED].bqgenai.sampled_github_timeline` AS
SELECT *
FROM `bigquery-public-data.samples.github_timeline`
WHERE repository_description IS NOT NULL
ORDER BY RAND()
LIMIT 1000;

-- Perform theme extraction using ML.GENERATE_TEXT function
CREATE OR REPLACE TABLE `[PROJECT_ID_MASKED].bqgenai.github_timeline_themes` AS
WITH PROMPT AS (
SELECT CONCAT(
"""
Extract the main themes from the following repository description and return in JSON format:
Text:
""", repository_description) AS prompt, repository_description
FROM `[PROJECT_ID_MASKED].bqgenai.sampled_github_timeline`
LIMIT 1000
),
EXTRACT_THEMES AS (
SELECT *
FROM
ML.GENERATE_TEXT(
MODEL `[PROJECT_ID_MASKED].bqgenai.llm_model`,
(SELECT * FROM PROMPT),
STRUCT(
1024 AS max_output_tokens,
0 AS temperature,
1 AS top_k,
0 AS top_p,
TRUE AS flatten_json_output))
)
SELECT ml_generate_text_llm_result, repository_description, prompt, ml_generate_text_status
FROM EXTRACT_THEMES;

It’s take about 16.5 minutes to excute the SQL for 1k rows contents.

View Results

To demonstrate the table with extracted themes, use the following query to select results where ml_generate_text_llm_result contains "large language model":

SELECT *
FROM `[PROJECT_ID_MASKED].bqgenai.github_timeline_themes`
WHERE ml_generate_text_llm_result LIKE '%large language model%'

The query retrieves rows where the theme “large language model” is identified, showcasing the model’s ability to accurately annotate key themes in the text. Below is a sample JSON result from the table, illustrating the identified themes:

The extracted themes from the repository_description field highlight various aspects of advanced technologies such as large language models, natural language processing, machine learning, artificial intelligence, deep learning, text generation, language understanding, computer vision, speech recognition, and natural language translation. These themes demonstrate the diverse capabilities of the gemini-pro model in identifying and categorizing important topics within textual data. This table provides a comprehensive view of the key areas the model has highlighted, which can be used for further analysis and decision-making.

{
"main_themes": [
"large language models",
"natural language processing",
"machine learning",
"artificial intelligence",
"deep learning",
"text generation",
"language understanding",
"computer vision",
"speech recognition",
"natural language translation"
]
}

Summary

This guide demonstrates how to effectively leverage BigQuery and Vertex AI’s gemini-pro model for theme extraction from large textual datasets. By setting up the environment, creating the necessary connections, and registering the gemini-pro model as a remote model, you can perform sophisticated text analysis directly within the BigQuery environment. Using a sample of 1000 records from a GitHub dataset, we showcased how to extract key themes from repository descriptions, highlighting the model’s ability to identify and categorize important topics. This process not only simplifies the workflow but also provides valuable insights that can drive informed decision-making. The integration of advanced language models within BigQuery underscores the potential of in-platform machine learning solutions in modern data analytics.

--

--