BigQuery as a Vector Database: Leveraging Retrieval-Augmented Generation (RAG)

Published in

Data on Cloud: GenAI, Data Science, and Data Engineering Insights

11 min readJun 24, 2024

Introduction

In recent years, the ability to efficiently retrieve and generate relevant information has become increasingly important in various applications. Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of information retrieval and natural language generation to provide more accurate and contextually relevant responses. In this article, we explore how to implement RAG using BigQuery, a widely-used Google Cloud Platform (GCP) database product, as a vector database.

BigQuery offers robust support for handling large datasets and performing advanced analytics, making it an excellent choice for implementing RAG. By leveraging BigQuery’s capability to generate text embeddings and perform vector similarity searches, we can enhance the accuracy and relevance of our information retrieval. In this implementation, we will use documentation from the BigQuery introduction pages as our dataset, demonstrating how BigQuery can be utilized as a vector database to support RAG.

Architecture Overview

In this implementation, our project architecture is divided into three main parts, each playing a crucial role in achieving effective Retrieval-Augmented Generation (RAG) using BigQuery as a vector database.

1. Data Collection

We use Python to scrape GCP documents. This step gathers all the necessary data focused on BigQuery-related content.

2. Operations in BigQuery

Embedding Generation: Call VertexAI PaLM Text Embedding model to generate embeddings for the collected data.
Vector Database Storage: Store these embeddings in BigQuery as a vector database, supporting efficient vector similarity searches.
Vector Queries via SQL: Perform vector queries using SQL to find documents that match the query vectors.

3. Using Gemini for RAG

Retrieval-Augmented Generation: Use Gemini as the LLM to process retrieved documents and generate responses based on the context provided by the vector search results.

By structuring our implementation in this way, we leverage BigQuery’s capabilities as a vector database, VertexAI’s embedding generation, and Gemini’s language generation to create an effective RAG system.

Preparing Your Environment for RAG Using BigQuery and Vertex AI

To begin with, we need to set up a Vertex AI connection to enable machine learning (ML) calls using BigQuery SQL. Follow these steps to configure your environment:

Create a BigQuery Dataset:

Open the Google Cloud Console, click on the Navigation Menu (Navigation menu icon), and scroll to the Analytics section. Select BigQuery.
In the Explorer pane of the BigQuery Console, click on the three dots next to your Project ID (e.g., qwiklabs-gcp-03-65d426d3b5ec) and select "Create dataset."
In the “Create dataset” dialog, set the Dataset ID to bqdocument, ensure the multi-region US location is selected, and leave the other options at their default values. Click "Create dataset."

Create an External Connection for PaLM Text Embeddings:

At the top of the Explorer pane (where your list of available BigQuery datasets is displayed), click the + Add button and select the "Connections to External data sources" button.
If prompted, enable the BigQuery API and click on “Connections to External data sources.”
In the “Connection type” dropdown, select “Vertex AI remote models, remote functions and BigLake (Cloud Resource).”
Set the Connection ID to vertex-ai-connection-id and ensure the location is set to Multi-region US. Click "Create Connection."

Grant Permissions to the Service Account:

In the Explorer pane, expand “External connections,” and select the us.vertex-ai-connection-id connection you just created.
In the Connection info pane, copy the Service account ID property value to the clipboard.
Click the Navigation Menu (Navigation menu icon) and go to the “IAM & Admin” service.
On the IAM page, click the “Grant Access” button, paste the service account ID into the Principals text box, assign the Vertex AI User and BigQuery Data Owner roles to the account, and click "Save."

Grant '`Vertex AI User' & 'BigQuery Data Owner' for the service account`

Enable Vertex AI APIs:

Click the Navigation Menu (Navigation menu icon) and navigate to the Vertex AI service.
Click the “Enable All Recommended APIs” button.

Enable All Recommended APIs on Vertex AI console

Create a Model in BigQuery:

In the BigQuery code editor, run the following command to create a model in your bqdocument dataset that connects to the PaLM Text Embedding model:

CREATE OR REPLACE MODEL `bqdocument.embedding_model`
REMOTE WITH CONNECTION `us.vertex-ai-connection-id`
OPTIONS (ENDPOINT = 'textembedding-gecko@002');

In the Explorer pane, expand the bqdocument dataset and verify that the model has been created.

By completing these steps, you have set up the necessary environment for using BigQuery and Vertex AI to implement a Retrieval-Augmented Generation (RAG) architecture.Finally, create a model in BigQuery. Navigate to the BigQuery service. In the BigQuery code editor, run the following command to create a model in your dataset that connects to the PaLM Text Embedding model:

Extracting and Preparing Custom Data for RAG Using BigQuery

In this section, we’ll replace the BBC News dataset with our own custom data sourced from the BigQuery introduction page. We will scrape the content of relevant documentation pages and use this data to implement the Retrieval-Augmented Generation (RAG) architecture.

Step-by-Step Code Explanation

We begin by extracting links and content from the BigQuery documentation pages using BeautifulSoup. Below is the optimized code:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

def fetch_page_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.text if soup.title else 'No Title Found'
    content = ' '.join([tag.text for tag in soup.find_all(['h2', 'p'])])
    return {'title': title, 'url': url, 'content': content}

clean_links = [
    'https://cloud.google.com/bigquery/docs/query-overview'
    # Include more URLs as needed
]

data = [fetch_page_data(url) for url in clean_links]
df = pd.DataFrame(data)

# Convert DataFrame to JSON and save to file
df.to_json('output.json', orient='records', lines=True)

# Print JSON schema for BigQuery
schema = [
    {'name': 'title', 'type': 'STRING', 'mode': 'NULLABLE'},
    {'name': 'url', 'type': 'STRING', 'mode': 'NULLABLE'},
    {'name': 'content', 'type': 'STRING', 'mode': 'NULLABLE'}
]

# Save schema to a JSON file
with open('schema.json', 'w') as f:
    json.dump(schema, f, indent=4)

# Display messages to confirm completion
print("DataFrame has been saved to 'output.json'.")
print("Schema has been saved to 'schema.json'.")

We are extracting documentation from the BigQuery introduction page. This involves:

Sending a GET request to the BigQuery introduction page to retrieve its HTML content.
Using BeautifulSoup to parse the HTML and extract all links pointing to BigQuery documentation pages.
Cleaning these links to remove query parameters and duplicates.
Fetching the content of each documentation page by extracting the title and concatenating the text from all <h2> and <p> tags.
Storing the collected data in a pandas DataFrame.

This custom dataset will replace the BBC News dataset originally used for our RAG implementation. By scraping relevant documentation pages, we ensure our dataset is focused on BigQuery-related content, making it suitable for answering questions specifically about BigQuery.

Creating a Table with Custom Data and Text Embeddings

In this task, we will create a table that contains documentation from the BigQuery introduction pages as well as text embeddings that can be used for a vector similarity search.

Run the following SQL command to create a table named bqdoc_with_embeddings with the documentation articles from our custom dataset and text embeddings generated with the model named embedding_model you created above. Notice that the title and content fields are concatenated into a new field called fullarticle. The function ML.GENERATE_TEXT_EMBEDDING looks for a field named fullarticle and generates the embeddings from that field.

CREATE OR REPLACE TABLE
  `bqdocument.bqdoc_with_embeddings` AS (
  SELECT
    *
  FROM
    ML.GENERATE_TEXT_EMBEDDING( MODEL `bqdocument.embedding_model`,
      (
      SELECT
        title,
        url,
        content,
        CONCAT(title, ": ", content) AS fullarticle
      FROM
        `bqdocument.bigquerydocs`

     )
     
    ) WHERE ARRAY_LENGTH(text_embedding) > 0
  );

Running a Vector Search Using BigQuery SQL

In this task, you will query the table for relevant documentation articles. Run the query below to return articles about the topic of “migration service.”

An embedding is created for the string “migration service.” This embedding is used in a similarity search, and the rows for the top 5 articles with embeddings closest to the query’s embedding are returned. Examine the results and verify the returned articles are related to the query.

SELECT query.query, base.title, base.content
FROM VECTOR_SEARCH(
  TABLE `bqdocument.bqdoc_with_embeddings`, 'text_embedding',
  (
  SELECT text_embedding, content AS query
  FROM ML.GENERATE_TEXT_EMBEDDING(
  MODEL `bqdocument.embedding_model`,
  (SELECT 'migration service' AS content))
  ),
  top_k => 5, options => '{"fraction_lists_to_search": 0.01}')

Programming a Q&A System Using BigQuery and Gemini

In this task, we will program a Q&A system using BigQuery and Gemini. Follow the steps below to set up and run the Q&A system in a Colab notebook.

Install Required Libraries & Initialize Vertex AI:

Enter the following code in the first cell and run it:

!pip install --upgrade --user google-cloud-aiplatform google-cloud-bigquery

# Get project ID
PROJECT_ID = ! gcloud config get-value project
PROJECT_ID = PROJECT_ID[0]
LOCATION = "us-central1" # @param {type:"string"}
print(PROJECT_ID)

from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=LOCATION)

print("Initialized")

Define Functions for Q&A System:

Add the following function to another new code cell. This function allows you to pass a prompt to the Gemini LLM and get an answer:

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def answer_question_gemini(prompt):
  model = GenerativeModel("gemini-pro")
  response = model.generate_content(
    prompt,
    generation_config={
        "max_output_tokens": 8192,
        "temperature": 0.5,
        "top_p": 0.5,
        "top_k": 10,
    },
  stream=False,
  )
  try:
    return response.text
  except:
    print("An Error Ocuured Cleaning the Data")
    return "An Error Ocuured Cleaning the Data"

Define Function to Run Search:

Add another code cell and paste the following function. This function takes the user’s question, creates an embedding, and uses a similarity search to find the top 5 articles most relevant to the question:

def run_search(question):
  from google.cloud import bigquery

  client = bigquery.Client()

  sql = """
      SELECT query.query, base.title, base.content
      FROM VECTOR_SEARCH(
      TABLE `bqdocument.bqdoc_with_embeddings`, 'text_embedding',
      (
      SELECT text_embedding, content AS query
      FROM ML.GENERATE_TEXT_EMBEDDING(MODEL `bqdocument.embedding_model`,
          (SELECT @question AS content))),
      top_k => 5)
      """

  job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("question", "STRING", question),
    ]
  )

  query_job = client.query(sql, job_config=job_config)

  data = ""
  for row in query_job:
    data += row.content + "\n"

  return data

Define Function to Build Prompt:

Add another code cell and paste the following function. This function uses the question and the data retrieved from BigQuery to build a prompt that will be passed to the model:

def build_prompt(data, question):
  prompt = """
    Instructions: Answer the question using the following Context.

    Context: {0}

    Question: {1}
  """.format(data, question)
  return prompt

Helper Function for Q&A:

Add another code cell and paste the following function. This helper function combines the steps to retrieve data, build the prompt, and get the answer from Gemini:

from IPython.core.display import display, HTML

def answer_question(question):

  data = run_search(question)
  display("Retrieved Data:")
  display(data)
  display(" . . . ")
  prompt = build_prompt(data, question)
  answer_gemini = answer_question_gemini(prompt)

  return answer_gemini

Test the Q&A System:

Add a final code cell to test the Q&A system

QUESTION = "Tell me about the bigquery migration"
answer_gemini = answer_question(QUESTION)
display("User Question:")
display(QUESTION)
display("--------------------------------")
display("Gemini Answer:")
display(answer_gemini)

This is the anwer from Gemini.

Retrieved Data:
This document provides an overview of the BigQuery Migration Service. The BigQuery Migration Service is a comprehensive solution for migrating your data\nwarehouse to BigQuery. It includes free-to-use tools that help you with\neach phase of migration, including assessment and planning, SQL\ntranslation for more than 10 dialects, data transfer, and data validation.\nTogether, these tools help you accelerate migrations and reduce risk, shortening\nthe time to value. In the assessment and planning phase, you can use the\nBigQuery migration assessment\nfeature to understand your existing data warehouse. Then, you can use the\nbatch SQL translator\nand the\ninteractive SQL translator\nto prepare your SQL queries and scripts to work in BigQuery. The\nbatch and interactive SQL translators support translation from a wide\nrange of SQL dialects. When you\'re ready to move your data, you can use the\nBigQuery Data Transfer Service\nto automate and manage the migration from your data warehouse to
 . . . 
User Question:
Tell me about the bigquery migration
--------------------------------
Gemini Answer:
## BigQuery Migration Service Overview\n\nThe BigQuery Migration Service is a comprehensive solution for migrating your data warehouse to BigQuery. It offers a variety of tools to assist you in every step of the migration process, including:\n\n**Assessment and Planning:**\n\n* **BigQuery Migration Assessment:** Analyze your existing data warehouse to understand its structure and data.\n* **Batch SQL Translator:** Convert your SQL scripts and queries from various dialects to BigQuery SQL.\n* **Interactive SQL Translator:** Translate individual SQL statements to BigQuery SQL.\n\n**Data Transfer:**\n\n* **BigQuery Data Transfer Service:** Automate and manage the transfer of data from your data warehouse to BigQuery.\n* **Cloud Storage:** Store your data files before and after the transfer.\n\n**Data Validation:**\n\n* **Data Validation Tool:** Verify the accuracy and completeness of your migrated data.\n\n**Benefits:**\n\n* **Free to use:** The BigQuery Migration Service itself is free o

Also I tried another question.

QUESTION = "How can I query a public dataset with the Google Cloud console"

answer_gemini = answer_question(QUESTION)
display("User Question:")
display(QUESTION)
display("--------------------------------")
display("Gemini Answer:")
display(answer_gemini)

Explanation

Install Required Libraries: Installs the necessary Google Cloud libraries.
Initialize Vertex AI: Sets up the project and location for Vertex AI.
Define Functions:
answer_question_gemini(prompt): Sends a prompt to Gemini LLM and retrieves the response.
run_search(question): Executes a similarity search in BigQuery using the user's question.
build_prompt(data, question): Constructs a prompt using retrieved data and the user's question.
answer_question(question): Retrieves data, constructs the prompt, and gets the answer from Gemini.
Test the System: Runs a test query to validate the Q&A system’s functionality.

By following these steps, you have programmed a Q&A system using BigQuery and Gemini, enabling you to ask questions and get relevant answers based on the documentation data stored in BigQuery.

Conclusion

The implementation of Retrieval-Augmented Generation (RAG) using BigQuery as a vector database showcases the powerful capabilities of BigQuery in supporting advanced information retrieval and natural language generation tasks. By leveraging BigQuery’s text embedding generation and vector similarity search functionalities, we can significantly improve the relevance and accuracy of the retrieved information.

While the subsequent Q&A system demonstrated in this article may not fully exploit the potential of RAG, the core functionality of using BigQuery for embedding and retrieval is a strong testament to its capabilities. As a widely-used GCP database product, BigQuery’s ability to support RAG through embedding generation is incredibly powerful, providing users with enhanced tools for data analysis and information retrieval.

In summary, BigQuery’s integration with RAG enables more efficient and contextually relevant information retrieval, making it a valuable asset for developers and data scientists looking to harness the power of machine learning and advanced analytics in their applications.