BigQuery Vector Search using Python SDK, Gemini and Langchain on GCP

Published in

Google Cloud - Community

7 min readApr 29, 2024

Google recently announced its latest feature to support vector embeddings in BigQuery. This is a huge step in the world of LLM driven applications where you can use a data-warehouse like bigquery to serve LLM applications as well.

The usage of embeddings in Bigquery is extremely simple, no different than any other vector databases out there, in fact simpler if you are using a BigQuery client in Python.

A typical flow while using vector embeddings along with an LLM has the below steps:

Step-1: Create embeddings of your dataset

Step-2: Store these embeddings in your vector database

Step-3: Query the vector database by passing the search query and fetch the results

Step-4: Pass the results to an LLM Model to get the final output

In this tutorial, we will be using New york tourism data which details the different places to visit in the city, the different museums, their timings etc.

I will be executing the entire code on a VertexAI Workbench to avoid the hassle of installing the dependent libraries and setting up permissions. But you can feel free to run this in your local system as well as long as you have setup the pre-requisites of creating an IAM role with the right permissions and are using it in your local environment.

Source code: https://github.com/sidoncloud/gcp-use-cases/tree/main/bigquery-vectorsearch-python-vertexAI

Step-1 — Setup a VertexAI Workbench

Start by heading over to Vertex AI from your GCP console. From the dashboard, select Workbench from the left navigation and select Create New.

We will create a simple workbench with Python 3 environment running on Debian operating system.

Once you’ve filled out the form options from the first step , click on Advanced Options at the bottom.

Here, we will just select a smaller instance which is e2-standard-2 and set the idle time shutdown to 30 minutes. This is important as the workbench will automatically shutdown in case of an idle time of 30 minutes or above.

Click on Create.

Give it a couple of minutes, you will soon have a Jupyter workbench up and running.

Step-2 — Install necessary libraries

Open up your Jupyterlab and create a new notebook.

First, we will install the below libraries :

1: langchain

2: langchain_google_vertexai

3: unstructured

4: pypdf

!pip install — upgrade langchain langchain_google_vertexai
!pip install — upgrade — quiet google-cloud-storage
! pip install — user — quiet unstructured pypdf

Once done, restart your kernel to be able to use these libraries in the subsequent code blocks.

Step-3 : Create Embeddings using Vertex Embeddings and storing them in BigQuery

Firstly, you must upload the PDF file newyork-city-tourism.pdf to a GCS bucket.

We then start importing the necessary libraries. This code uses textembedding-gecko of VertexAI Embeddings in order to create embeddings out of the data inside the PDF.

from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import BigQueryVectorSearch
from langchain.document_loaders import GCSFileLoader
PROJECT_ID = “your-project-id”
embedding = VertexAIEmbeddings(
model_name=”textembedding-gecko@latest”, project=PROJECT_ID
)
GCS_BUCKET_DOCS = “bucket-name”
PDF_BLOB = “newyork-city-tourism.pdf”

Next, we load the pdf using GCSFileLoader and store it in a variable.

from langchain_community.document_loaders import PyPDFLoader
def load_pdf(file_path):
return PyPDFLoader(file_path)
loader = GCSFileLoader(
project_name=PROJECT_ID, bucket=GCS_BUCKET_DOCS, blob=PDF_BLOB, loader_func=load_pdf
)
documents = loader.load()

We then iterate through the documents to fetch the metadata of the input dataset.

for document in documents:
doc_md = document.metadata
document_name = doc_md[“source”].split(“/”)[-1]
# derive doc source from Document loader
doc_source_prefix = “/”.join(GCS_BUCKET_DOCS.split(“/”)[:3])
doc_source_suffix = “/”.join(doc_md[“source”].split(“/”)[4:-1])
source = f”{doc_source_prefix}/{doc_source_suffix}”
document.metadata = {“source”: source, “document_name”: document_name}
print(f”# of documents loaded (pre-chunking) = {len(documents)}”)

Now, we split the documents by using langchain’s text_splitter and chunk them by defining the chunk_size and chunk_overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=50,
separators=[“\n\n”, “\n”, “.”, “!”, “?”, “,”, “ “, “”],
)
doc_splits = text_splitter.split_documents(documents)
# Add chunk number to metadata
for idx, split in enumerate(doc_splits):
split.metadata[“chunk”] = idx
print(f”# of documents = {len(doc_splits)}”)

We will now create a schema in bigquery where the table with embeddings will be created when we run the subsequent steps. Head over to bigquery and create a schema by running the below sql .

create schema vector_db;

Next, define the dataset name (created above) and also a table name to store the embeddings and then create an object of Bigquery vector search to store the embeddings inside the table nyc_tourism.

DATASET = “vector_db”
TABLE = “nyc_tourism”
new_york_tourism = BigQueryVectorSearch(
project_id=PROJECT_ID,
dataset_name=DATASET,
table_name=TABLE,
location=”US”,
embedding=embedding,
)
new_york_tourism.add_documents(doc_splits)

Upon successful execution of this code block, you should see the table nyc_tourism created inside your newly created dataset.

Click on preview in order to see the data inside this table. You will see the metadata, the content (actual data) and its respective embeddings as well.

Step-4 : Fetching similar results and using Langchain

Now lets ask a simple question : “What are some of the attractions one can find in Central Park, New York City?” and see the results fetched from BigQuery. We will invoke the similarity_search method from the Vertex SDK.

query = “””What are some of the attractions one can find in Central Park, New York City?”””
new_york_tourism.similarity_search(query)

Alright,so far so good. Everything works as expected as you can see the result is quite relevant to the question.

Now lets use a retriver which will point to the vector table in Bigquery and also define our LLM. We will use gemini-pro as our LLM here but it can be any LLM of your choice in general.

from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA
llm = VertexAI(model_name=”gemini-pro”)
retriever = new_york_tourism.as_retriever()

Once the retriever is created, we will invoke from_chain_type method followed by invoke in order to fetch the results. Here, we will use the “stuffing” mechanism which pretty much stuffs the entire context while invoking the LLM Model.

We will change the question here and ask “What days and times is the American Museum of Natural History open to the public?”

search_query = “””What days and times is the American Museum of Natural History open to the public?”””
retrieval_qa = RetrievalQA.from_chain_type(
llm=llm, chain_type=”stuff”, retriever=retriever,
return_source_documents=True
)
results = retrieval_qa.invoke(search_query)
print(“*” * 79)
print(results[“result”])
print(“*” * 79)
for doc in results[“source_documents”]:
print(“-” * 79)
print(doc.page_content)

You can see the output which is unfiltered , which is expected. Now, lets use RetrievalQAWithSourcesChain to just get the answers. We will stick to the same question as before.

from langchain.chains import RetrievalQAWithSourcesChain
search_query = “””
What days and times is the American Museum of Natural History open to the public?
“””
retrieval_qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm, chain_type=”stuff”, retriever=retriever
)
retrieval_qa_with_sources({“question”: search_query}, return_only_outputs=True)

The response is quite accurate as you can see.

Step-5 : Conversation using langchain memory

If you are building a chatbot which looks up your vector store,often times it’s quite important to store the memory in an on-going back and forth conversation. This is where we will make use of Lanchain’s ConversationBufferMemory.

Lets start with an initial question asking the historical significance of Chrysler building and then add a follow up question asking about its architecture without giving the details in the follow up question.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(
llm=llm, retriever=retriever, memory=memory
)
search_query = “””What is the historical significance of the Chrysler Building as described in the New York City tourism guide?”””
result = conversational_retrieval({“question”: search_query})
print(result[“answer”])
new_query = “how about the architectural style?”
result = conversational_retrieval({“question”: new_query})
print(result[“answer”])

That was all for this tutorial, feel free to message me if you have any questions or if you want to know more about LLM Model deployments on AWS or Google Cloud. :-)