Talk to pdf using Bigquery, GPT4-Turbo & langchain with memory

Sid
CodeX
Published in
5 min readMay 12, 2024

In this brief article, i am going to show you how to leverage the langchain framework with OpenAI (gpt-4) to work with Google clouds BigQuery vector search offering.

We are going to use a PDF file which provides a comprehensive overview of trends in AI research and development as of 2023. It covers various aspects of AI advancements including the growth in AI publications, the evolution of machine learning systems, and significant trends in AI conference attendance and open-source AI software. Key highlights include detailed statistics on AI journal, conference, and repository publications categorized by type, field of study, and geographic area.

This PDF will be converted to text embeddings after which i will show you how to retrieve them using langchain’s ConversationalRetrievalChain with memory by creating a retriever object which will point to the embeddings and eventually talk to the PDF using simple search queries.

So lets begin.

Note- You need an active GCP account for this tutorial, even a trial account will do.

Step-1: Install the necessary modules in your local environment

pip3 install — upgrade langchain langchain_google_vertexai

pip3 install — upgrade — quiet google-cloud-storage

pip3 install pypdf

Step-2: Create a BigQuery Schema and download credentials file from GCP Account

Head over to bigquery, open up an editor and create a schema. Call it bq_vectordb and this is the schema where the table which will store our vector embeddings will be created.

Now, navigate to IAM from the GCP console and select Service Accounts from the left navigation. Here we will create and download the permissions json file containing the private key which we will use in the Python script. This json file grants our local environment access to the services in our GCP account on a project level.

Click on Manage keys and then select ADD KEY followed by Create new key. Thats it, select the key type as JSON and a file will be automatically downloaded to your system.

Rename and copy this file to your current working directory.

That was as far as the environment setup goes, now we can get to the execution part.

Step-3: Create and Ingest Embeddings using VertexAIEmbeddings, GCSFileLoader & BigQueryVectorSearch

First, we need to create embeddings from the PDF File: example.pdf using VertexAIEmbeddings. To do that, we load this pdf file from a GCS bucket using GCSFileLoader from langchain and use the RecursiveCharacterTextSplitter to split this pdf into several chunks with an overlap size set to 100.

NOTE: Before you execute the below code, make sure to upload example.pdf to a gcs bucket and change the path values accordingly.

from langchain_google_vertexai import VertexAIEmbeddings

from langchain_community.vectorstores import BigQueryVectorSearch

from langchain.document_loaders import GCSFileLoader

from langchain_community.document_loaders import PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

import os

os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘your-json-filename.json’

PROJECT_ID = “{project-id}”

embedding = VertexAIEmbeddings(

model_name=”textembedding-gecko@latest”, project=PROJECT_ID

)

gcs_bucket_name = “your-bucket-name”

pdf_filename = “test_data/example.pdf”

def load_pdf(file_path):

return PyPDFLoader(file_path)

loader = GCSFileLoader(

project_name=PROJECT_ID, bucket=gcs_bucket_name, blob=pdf_filename, loader_func=load_pdf

)

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=10000,

chunk_overlap=100,

separators=[“\n\n”, “\n”, “.”, “!”, “?”, “,”, “ “, “”],

)

doc_splits = text_splitter.split_documents(documents)

for idx, split in enumerate(doc_splits):

split.metadata[“chunk”] = idx

print(f”# of documents = {len(doc_splits)}”)

Once you have chunked your PDF data, now its time to ingest it into BigQuery vector search.

Define your dataset (created in the first step) and table name. The table will be created at run time. Next, create an object BigQueryVectorSearch and use this object to invoke the add_documents method.

DATASET = “bq_vectordb”

TABLE = “bq_vectors” # You can come up with a more innovative name here

bq_object = BigQueryVectorSearch(

project_id=PROJECT_ID,

dataset_name=DATASET,

table_name=TABLE,

location=”US”,

embedding=embedding,

)

bq_object.add_documents(doc_splits)

You can execute the entire bq_ingest_data.py script as a single python script.

Once the execution is complete, you can head back to Bigquery and refresh your schema. You should see a table bq_vectors with the below columns and data. This means your embeddings have been created and are now stored in a BigQuery table.

Step-4: Retrieve embeddings & use langchain with OpenAI to chat with your data

Most of the below code is self-explanatory. We import the necessary libraries and use langchains ConversationBufferMemory which will retain the history of the chat in the subsequent messages which, is quite important if you are building a chatbot.

Make sure to use the actual values in the below script before executing it.

from langchain_community.vectorstores import BigQueryVectorSearch
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
import pandas as pd
import os

api_key = “your-openai-api-key”
os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘json-filename.json’

DATASET = “bq_vectordb”
TABLE = “bq_vectors”
PROJECT_ID = “project-id”

embedding = VertexAIEmbeddings(
model_name=”textembedding-gecko@latest”, project=PROJECT_ID
)

memory = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True,output_key=’answer’)

bq_object = BigQueryVectorSearch(
project_id=PROJECT_ID,
dataset_name=DATASET,
table_name=TABLE,
location=”US”,
embedding=embedding,
)

You can execute this code inside a jupyter notebook.

We now define our llm model and create a retriever object which will point to the embeddings stored in the bigquery table.

llm_openai = ChatOpenAI(model=”gpt-4-turbo-2024–04–09",api_key=api_key)
retriever = bq_object.as_retriever()

conversational_retrieval = ConversationalRetrievalChain.from_llm(
llm=llm_openai,retriever=retriever, memory=memory,verbose=False
)

Define a function which will simply accept a user query and return the answer from the bigquery vector table.

def QaWithMemory(query):
return conversational_retrieval.invoke(query)[“answer”]

Now lets ask a question : “ What was the rate of growth in AI research publications from 2010 to 2021,
and which type of AI publication saw the most significant increase in this period?

You can see the response. Its quite accurate if you read the PDF content. You can ask a followup question now without giving too many details, such as “and how might this growth impact the future of AI research priorities?”

Alright, that was it for this tutorial. Hope you enjoyed it :-) . Stay tuned for more. Cheers

Full Source code: https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai

--

--

Sid
CodeX
Writer for

Passionate data expert & Udemy instructor with 20k+ students, helping startups scale and derive value from data.