Optimizing Marketing Analytics within BigQuery: The Power of In-Platform LLM Integration

Willy Zhuang
5 min readJan 31, 2024

--

Unlocking the full potential of marketing analytics through BigQuery’s integration with Large Language Models like text-bison, transforms raw data into strategic insights, revolutionizing how we understand and engage with our audience.

Introduction: Enhancing Data Analysis through LLM Integration in BigQuery

The integration of Large Language Models (LLMs), particularly text-bison, within the BigQuery environment, heralds a new era in marketing analytics. This approach emphasizes the convenience and efficiency of conducting complex data analyses without leaving the BigQuery platform. Utilizing BigQuery’s ML.GENERATE_TEXT function, analysts can now process vast datasets, such as ‘bigquery-public-data.imdb.reviews’, within the same environment, streamlining the workflow. This in-platform integration not only simplifies text analysis tasks but also accelerates the extraction of meaningful insights, making it an indispensable tool for data-driven marketing strategies.

The Underlying Model of the ML.GENERATE_TEXT Function

The text-bison model, a variant of the PaLM 2 LLM, serves as the backbone for the ML.GENERATE_TEXT function in BigQuery. It is fine-tuned for natural language understanding and is capable of performing a variety of language tasks such as classification, summarization, and extraction. The model boasts a maximum input of 8192 tokens and can generate up to 1024 output tokens. It is trained on data up to February 2023 and supports supervised learning and Reinforcement Learning from Human Feedback (RLHF), but not distillation. This makes it exceptionally suited for marketing analytics within BigQuery, enabling users to extract complex insights from large textual datasets. The ML.GENERATE_TEXT function leverages this model to analyze data, such as user behavior patterns, without the need for external processing tools. This integrated approach streamlines workflow and enhances the analytical capabilities within the BigQuery environment.

Implementing LLMs in BigQuery: A Step-by-Step Guide

To utilize the Large Language Models (LLMs) within BigQuery, one must first establish the necessary setup. This involves creating a remote model based on the text-bison LLM and using the ML.GENERATE_TEXT function for text generation tasks. The process is as follows:

Setting Up the Environment:

  1. Creating an External Connection:
  • Ensure the BigQuery Connection API is enabled in your Google Cloud project.
  • Create a connection using the following bash command. Remember to replace my_project with your own project ID:
bq mk --connection --location=us --project_id=my_project \
--connection_type=CLOUD_RESOURCE vertex_genai_conn

2. Retrieving the Service Account ID:

fter creating the connection, use this command to retrieve the service account ID. Again, replace my_project with your project ID:

bq show --connection my_project.us.vertex_genai_conn

3. Granting Permissions:

Grant the “Vertex AI User” role to the connection’s service account for using Vertex AI resources. Replace my_project with your project ID and use the service account ID obtained from the previous step:

gcloud projects add-iam-policy-binding my_project \
--member='serviceAccount:<YOUR_SERVICE_ACCOUNT_ID>' \
--role='roles/aiplatform.user'
  • Note: Replace <YOUR_SERVICE_ACCOUNT_ID> with the actual service account ID you retrieved.

4. Registering the PaLM LLM Model as a Remote Model:

Register the PaLM LLM model in BigQuery with this SQL command. Make sure to use your specific Google Cloud project ID and adjust the connection path as per your setup:

CREATE MODEL `my_project.temp.llm_model`
REMOTE WITH CONNECTION `my_project.us.vertex_genai_conn`
OPTIONS (remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1');
  • Note: Replace my_project with your own project ID.

Original Example Context:

The original tutorial demonstrates keyword extraction from the bigquery-public-data.imdb.reviews table, providing a clear example of how to apply the model to analyze text data.

Our Application Scenario:

Our focus shifts to analyzing user behavior using GA4 data in BigQuery, specifically the bigquery-public-data.ga4_obfuscated_sample_ecommerce public dataset. Our code performs the following steps:

  1. Data Aggregation: We compile user data including page views, session durations, and source mediums.
  2. User Summary Creation: We generate a summary for each user, combining their activity data into a readable format.
  3. Tag Generation: Using the ML.GENERATE_TEXT function, we process these summaries through the text-bison model to generate concise, informative tags about user behavior.

Here’s the SQL code snippet for this process:

WITH page_views AS (
SELECT
user_pseudo_id,
STRING_AGG(DISTINCT param.value.string_value, ', ') AS page_titles
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`,
UNNEST(event_params) AS param
WHERE
event_name = 'page_view' AND param.key = 'page_title'
GROUP BY
user_pseudo_id
),
session_durations AS (
SELECT
user_pseudo_id,
(MAX(event_timestamp) - MIN(event_timestamp)) / 1000000 as session_duration
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
GROUP BY
user_pseudo_id
),
source_medium AS (
SELECT
user_pseudo_id,
MAX(CASE WHEN traffic_source.source IS NOT NULL THEN CONCAT(traffic_source.source, ' / ', traffic_source.medium) ELSE '(not set)' END) as source_medium
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
GROUP BY
user_pseudo_id
),
combined_data AS (
SELECT
p.user_pseudo_id,
CONCAT(
'This user from ', s.source_medium,
', had visited ', p.page_titles,
', stayed ', CAST(d.session_duration AS STRING), ' seconds.'
) AS summary
FROM
page_views p
JOIN
session_durations d ON p.user_pseudo_id = d.user_pseudo_id
JOIN
source_medium s ON p.user_pseudo_id = s.user_pseudo_id
)
SELECT
c.user_pseudo_id,
c.summary,
ml_generate_text_result['predictions'][0]['content'] AS tags
FROM
combined_data c,
ML.GENERATE_TEXT(
MODEL `tcloud-data-analysis.bqml_tutorial.llm_model`,
(SELECT CONCAT('Based on summarizing this text, give me a tag about the user behavior. The tag should in 3 words', summary) AS prompt FROM combined_data),
STRUCT(0.2 AS temperature, 100 AS max_output_tokens)
)
WHERE
LENGTH(c.summary) > 0
LIMIT 20;
Bigquery running result

Upon running the SQL query, the output provides insightful results in three columns: user_pseudo_id, summary, and tags. Here's a breakdown of what each column represents:

  1. user_pseudo_id: A unique identifier for each user. This helps in distinguishing between different user sessions and behaviors.
  2. summary: A concise description of the user’s activity, including the source of the traffic, the pages visited, and the duration of the visit. For example, “This user from (direct) / (none), had visited The Google Merchandise Store — Log In, The Google Merchandise Store — My Account, stayed 28.042822 seconds.”
  3. tags: The tags generated by the LLM, which provide quick, valuable insights into the user’s behavior or the nature of the visit. These tags, like “Referral traffic, short visit,” “Organic shopper,” or “Long browsing session,” offer a snapshot of the user’s engagement and intentions.

Conclusion

The output from the text-bison LLM in BigQuery reveals the model’s capability to distill complex user behavior data into succinct, meaningful tags. These tags not only reflect the nature of each visit but also categorize users based on their interaction patterns, providing a rich dataset for targeted marketing strategies and customer behavior analysis. This technology enables marketers to move beyond traditional analytics and embrace a more nuanced, AI-driven approach to understand their audience. Ultimately, the application of LLMs in BigQuery for user behavior analysis represents a significant step forward in harnessing the power of big data for strategic business insights.

--

--