Powering Q&A Chatbots with AlloyDB and GenAI Models

Published in

Google Cloud - Community

7 min readAug 31, 2023

In recent times, there has been a growing interest in the use of embeddings and vector databases for a variety of applications, including natural language processing, machine learning, and recommendation engines. Embeddings are a way of representing data as vectors, which can then be used to calculate similarities and perform other operations. Vector databases are specialised databases that are designed to store and query embeddings efficiently.

AlloyDB is a next-generation managed PostgreSQL database that offers a number of features that make it well-suited for storing and querying embeddings. For example, AlloyDB supports the pgvector extension, which provides a number of features for storing and indexing vectors.

In this blog post, we will show how to use AlloyDB’s pgvector extension to store vector embeddings of a PDF file into AlloyDB for PostgreSQL and also will create a small Q&A chatbot on top of it.

Before we jump into the implementation, first lets understand some basic information about AlloyDB, Embeddings and Vector Databases.”

What are Embeddings?

Embeddings are a way of representing data as vectors. This can be useful for a variety of tasks, such as natural language processing, machine learning, and recommendation engines.

For example, we can represent a word as a vector of real numbers. The meaning of the word is then represented by the values of the vector. We can then use these vectors to calculate similarities between words, or to perform other operations.

What are Vector Databases?

Vector databases are specialised databases that are designed to store and query embeddings efficiently. Vector databases typically have features such as:

Support for storing vectors of different sizes
Support for indexing vectors for efficient search
Support for vector operations such as addition, subtraction, and multiplication

Why Use AlloyDB for Storing Embeddings?

AlloyDB is a next-generation managed PostgreSQL database that offers a number of features that make it well-suited for storing and querying embeddings. These features include:

Support for the pgvector extension
High performance for storing and querying vectors
Scalability to handle large datasets
Security and compliance features

Pre-requisites

Before we get into the actual implementation, please read through the following topics

Embedding Model Gecko — This will be used to generate embeddings of the source documents.
Text Generation Model Bison — This will be used to generate text while providing answers.
Langchain
AlloyDB for PostgreSQL Cluster

High Level Architecture Diagram

High Level Architecture | QA chatbot | AlloyDB | pgvectordb

Implementation

Step — 1 : Enable the Vertex AI API.

Step — 2: Create pgvector extension in AlloyDB using the following command

CREATE EXTENSION IF NOT EXISTS vector;

Step — 3: Create table to store the contents of the PDF and their respective embeddings

CREATE TABLE pdf_content_embeddings(
 content_id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(768));

Step — 3: Provide Vertex AI User/Admin access to the default compute engine service account.

Step — 4: Export the below mentioned environmental variables on your compute.

export DATABASE_NAME=<DATABASE_NAME>
export DB_HOST=<ALLOYDB_HOST_IP_ADDRESS>
export DB_USER=<DATABASE_USER_NAME>
export DB_PASSWORD=<DATABASE_PASSWORD>

Step — 5: We will be making use of 6 python scripts here, copy and paste it according to the names provided.

alloydb_utility.py — This utility script will help in making database connection and calls with AlloyDB database.

import psycopg2
import os

class AlloyDB:
    """
    class AlloyDB will make the connection with the AlloyDB's columnar engine
    """
    def __init__(self):
        
        config=self._read_config()
        
        self.database_nm=config['database_nm']
        self.user=config['user']
        self.password=config['password']
        self.host=config['host']
        self.port='5432'
       
        self.cursor=self._connection()
    
    def _read_config(self):
        
        config={}
        config['database_nm']=os.environ['DATABASE_NAME']
        config['user'] = os.environ['DB_USER']
        config['password']=os.environ['DB_PASSWORD']
        config['host']=os.environ['DB_HOST']
        
        return config
        
    def _connection(self):
        try:
            conn = psycopg2.connect(
                database=self.database_nm, user=self.user, password=self.password, host=self.host, port=self.port
                )
            conn.autocommit = True
            cursor = conn.cursor()
            return cursor
        except Exception as e:
            print(e)
    def select_query(self,query):
        
        self.cursor.execute(query)
        result_data=self.cursor.fetchall()
        return result_data

    def insert_embeddings(self,query):
        self.cursor.execute(query)

custom_vertex_ai_embeddings.py — This python script will be used to override the existing VertexAIEmbeddings class with embed_documents method. Using this class and it’s functions we will create vectors from text data.

import time
from typing import Optional, Tuple, List
from pydantic import BaseModel
from langchain.embeddings import VertexAIEmbeddings



class CustomVertexAIEmbeddings(VertexAIEmbeddings, BaseModel):
  requests_per_minute: int
  num_instances_per_batch: int

  def rate_limit(self,max_per_minute):
    period = 60 / max_per_minute
    print('Waiting')
    while True:
      before = time.time()
      yield
      after = time.time()
      elapsed = after - before
      sleep_time = max(0, period - elapsed)
      if sleep_time > 0:
        print('.', end='')
        time.sleep(sleep_time)
  
  # Overriding embed_documents method
  def embed_documents(self, texts: List[str]):
    limiter = self.rate_limit(self.requests_per_minute)
    results = []
    docs = list(texts)

    while docs:
      # Working in batches because the API accepts maximum 5
      # documents per request to get embeddings
      head, docs = docs[:self.num_instances_per_batch], docs[self.num_instances_per_batch:]
      chunk = self.client.get_embeddings(head)
      results.extend(chunk)
      next(limiter)

    return [r.values for r in results]

document_reader_utility.py — This utility will help in downloading and reading the document from Google Cloud Storage bucket.

import os
from langchain.document_loaders import PyPDFLoader
import vertexai
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google.cloud import storage
import json

class DocumentReader:
    def __init__(self) -> None:
        config=self._read_config()
        self.gcs_folder_prefix=config['gcs_prefix']
        self.gcp_project_name=config['project_nm']
        self.bucket_name=config['gcs_bucket_nm']
        vertexai.init(project=self.gcp_project_name)

    def _read_config(self):
        config_file=open('config.json')
        config=json.load(config_file)
        config_file.close()
        return config

    def _document_loader(self):
        documents=[]
        for file in os.listdir('./Data/'):
            if file.endswith('.pdf'):
                pdf_data_path='./Data/'+file
                loader=PyPDFLoader(pdf_data_path)
                documents.extend(loader.load())
        return documents

    def text_splitter(self):
        documents=self._document_loader()
        text_splitter=RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
        docs=text_splitter.split_documents(documents)
        return docs
    
    def download_file_from_GCS(self):
        dl_dir='Data/'
        storage_client=storage.Client()
        bucket=storage_client.get_bucket(self.bucket_name)
        blobs=bucket.list_blobs(prefix=self.gcs_folder_prefix)
        for blob in blobs:
            filename=blob.name.replace('/','_')
            blob.download_to_filename(dl_dir+filename)

    def create_temp_data_dir(self):
        os.popen('mkdir Data')
    
    def remove_dir(self):
        os.popen('rm -rf ./Data')

embeddings_utility.py — This is a wrapper script which will be used to generate vectors from text data. This will also insert the text with its vector embeddings into the pdf_content_embeddings table in AlloyDB.

from custom_vertex_ai_embeddings import CustomVertexAIEmbeddings
from alloydb_utility import AlloyDB
class EmbeddingsUtility:

    def __init__(self,docs):
        self.docs=docs
    def generate_embeddings_client(self):
        EMBEDDING_QPM=100
        EMBEDDING_NUM_BATCH=5
        embeddings= CustomVertexAIEmbeddings(requests_per_minute=EMBEDDING_QPM,
        num_instances_per_batch=EMBEDDING_NUM_BATCH)
        return embeddings
    
    def create_embeddings(self):
        insert_statement='INSERT INTO pdf_content_embeddings(content,embeddings) VALUES '
        embeddings_client=self.generate_embeddings_client()
        for doc in self.docs:
            tuple1=(doc.page_content,embeddings_client.embed_documents(doc.page_content)[0])
            insert_statement = insert_statement + str(tuple1).replace("'","") + ';'
            AlloyDB().insert_embeddings(insert_statement)
            insert_statement='INSERT INTO pdf_content_embeddings(content,embeddings) VALUES '
    
    def get_embeddings(self):
        embeddings_client=self.generate_embeddings_client()
        return embeddings_client.embed_documents(self.docs)[0]

generate_answers.py — This utility will perform the similarity search operation on AlloyDB and provide us with the similar results to our questions and then will generate text from the relevant description.

from langchain.chains.summarize import load_summarize_chain
from langchain.llms import VertexAI
from langchain import PromptTemplate, LLMChain
from alloydb_utility import AlloyDB
from langchain.docstore.document import Document

class fetchResults:
    def __init__(self,question,question_embeddings) :
        self.matches=[]
        self.question=question
        self.question_embeddings=question_embeddings
        self.similarity_threshold = 0.7
        self.num_matches = 5
        
        self.llm= VertexAI(
        model_name='text-bison@001',
        max_output_tokens=256,
        temperature=0.1,
        top_p=0.8,
        top_k=40,
        verbose=True,
        )

        self.map_prompt_template = """
              You will be given a detailed description of Bigquery feature information.
              This description is enclosed in triple backticks (```).
              Using this description only, extract the summary of the description.

              ```{text}```
              SUMMARY:
              """
        self.combine_prompt_template = """
                You will be given a detailed description of Bigquery feature information.
                This description is enclosed in triple backticks (```). and a question enclosed in
                double backticks(``).
                Select the description which is most relevant to answer the question.
                Using that selected toy description, answer the following
                question in as much detail as possible.
                You should only use the information in the description.
                Your answer should be less than 200 words.
                Your answer should be in a bulleted list format wherever required.


                Description:
                ```{text}```


                Question:
                ``{user_query}``


                Answer:
                """
        
    def similarity_search(self):    
        self.get_matching_contents_qry = f"""
                            WITH vector_matches AS (
                              SELECT content_id, 1 - (embedding <=> '{self.question_embeddings}') AS similarity
                              FROM pdf_content_embeddings;
                              WHERE 1 - (embedding <=> {self.question_embeddings}) > {self.similarity_threshold}
                              ORDER BY similarity DESC
                              LIMIT {self.num_matches}
                            )
                            SELECT content FROM pdf_content_embeddings
                            WHERE content_id IN (SELECT content_id FROM vector_matches)
                            """
    
        results=AlloyDB().select_query(self.get_matching_contents_qry)
        if len(results) == 0:
            raise Exception("Did not find any results. Adjust the query parameters.")
        for r in results:
            # Collect the text content for all the matches to the question.
            self.matches.append(r[0])
        return self.matches

    def form_answer(self):
        matches=self.similarity_search()
       
        map_prompt = PromptTemplate(template=self.map_prompt_template, input_variables=["text"])
        combine_prompt = PromptTemplate(template=self.combine_prompt_template, input_variables=["text", "user_query"])
        docs = [Document(page_content=t) for t in matches]
        chain = load_summarize_chain(
            self.llm, chain_type="map_reduce", map_prompt=map_prompt, combine_prompt=combine_prompt
            )   
        answer = chain.run(
            {
                "input_documents": docs,
                "user_query": self.question,
            }
            )
        return answer

main.py — Driver program for the utility

from pydoc import doc
from document_reader_utility import DocumentReader
from embeddings_utility import EmbeddingsUtility

from alloydb_utility import AlloyDB
from generate_answers import fetchResults
import sys



if __name__ == '__main__':
    #Define Database Connections and Retriever Below
    db=AlloyDB()._connection()
    
    if sys.argv[1] == 'build':
        #Source Document Reader
        document_reader_obj=DocumentReader()
        document_reader_obj.create_temp_data_dir()
        document_reader_obj.download_file_from_GCS()
        docs=document_reader_obj.text_splitter()

        #Generate Embeddings for the text data
        embeddings=EmbeddingsUtility(docs)
        embeddings.create_embeddings()

        #Remove Temp Data folder
        document_reader_obj.create_temp_data_dir()
    
    elif sys.argv[1] == 'run':
        question_counter=0
        while True:
            if question_counter ==0:
                question=input("Hello How can I help you ? \n\n You can ask me questions related to Bigquery Optimization. \n\n Anytime you want to stop, simply say Thankyou or Goodbye :)")
            else:
                question=input("Anything else where I can help ?")
            if 'goodbye' not in question.lower() or 'thankyou' not in question.lower():
                embeddings_obj=EmbeddingsUtility(question)
                ques_embeddings=embeddings_obj.get_embeddings()
                fetchResults_obj=fetchResults(question,ques_embeddings)
                print(fetchResults_obj.form_answer())
                question_counter +=1
            else:
                print('Thank you! Hope you had a great experience.')
                break

    else:
        print('Please enter valid input, either run/build as a command line argument')

requirements.txt — Dependencies for the utility

google-cloud-aiplatform==1.32
shapely==1.8.2
pypdf==3.10.0
google-cloud-storage==2.10.0
langchain==0.0.222
transformers==4.30.1
torch==2.0.1

config.json — GCS Bucket, File prefix and Project configurations should be stored in config.json

{
    "gcs_prefix":"",
    "gcs_bucket_nm":"",
    "project_nm":""
}

To Generate Embeddings and Store into AlloyDB use :

python3 main.py build

To start the chat application use :

python3 main.py run

This chatbot is created using the BigQuery optimization PDF document, but it can use any document stored in Google Cloud Storage. We have used the following technologies to create the chatbot:

textembedding-gecko: This creates embeddings for the content stored in PDF documents.
pgvector extension: This extension for PostgreSQL allows us to store and query embeddings efficiently.
AlloyDB for Postgres: This fully-managed, PostgreSQL-compatible database service from Google Cloud Platform provides high performance and scalability for our chatbot.
text-bison: This library generates text and summaries from text embeddings.

We successfully created a Q&A chatbot using AlloyDB and GCP’s GenAI models. I hope you learnt something new from this blog.

Thanks for reading !

Additional References :

If you want to learn more around Database or Data&Analytics, do checkout my other blogs.