Compare PDF Question Answering Systems Build with OpenAI and Google VertexAI

11 min readJun 12, 2023

LLMs are trained on large amounts of unstructured data, which allows them to generate text and store factual knowledge. However, they have a few limitations:

They are not always up-to-date. LLMs are trained on a massive dataset of text and code, which can take months or even years to collect, and the training process is very expensive. This means that they may not be aware of the latest information, such as new events or scientific discoveries. For example, the most recent ChatGPT 4 models were trained on the Sep 2021 dataset.
They are not always interpretable. LLMs make predictions by looking up information stored in their parameters. This can make it difficult to understand why they make the predictions they do.
They are not always effective in domain-specific tasks. LLMs are trained on a general corpus of natural language data. This means that they may not be as effective for tasks that require specialized knowledge, such as medical diagnosis or legal research.

There are currently two ways to reference specific data in LLMs:

Fine-tuning: delta training an LLM on a smaller dataset that is relevant to the task at hand. This can improve the accuracy and interpretability of the LLM for that task. However, the ultimate effect of fine-tuning is depends on the quantity and quality of the training data; it’s significantly more expensive than the base model. The fine-tuning process cannot capture very recent knowledge because of the schedule of the fine-tuning process. And there is one major problem that makes fine-tuning not always an option: access control. Once the model has been fine-tuned on a certain dataset, the knowledge will be visible to everyone who has access to the new model.
Insert data as context in the model prompt: This provides the information that the model can use while creating the result. However, models come with a limited context size, and including all the documents as context may not fit into the allowed context size of the model.

Retrieval Augmented Generation (RAG) is a more recent technique that can improve the performance of LLMs. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. This allows the model to generate more accurate and relevant responses, even when the context is large.

How does RAG Work

RAG models were introduced by Lewis et al. in 2020. They are a type of LLM model architecture that has two types of memory: parametric and non-parametric. The parametric memory is a pre-trained language model, and the non-parametric memory is a dense vector index of a knowledge library.

Architecture of Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has two phases: indexing and retrieval.

Indexing

In the indexing phase, the raw unstructured data is first chunked into smaller pieces. These chunks are then converted to high-dimensional vectors and stored in a vector store. This is done using a technique called embedding, which is a way of representing words as vectors that capture their meaning. There are quite a few different embedding technologies to choose from: word embedding, sentence embedding, and document embedding. It can be done by using LLM or other inexpensive NLP models as well. Vector stores are a type of database that is optimized for fast vector search. This is in contrast to traditional databases, which are not optimized for vector search.

Retrieval

In the retrieval phase, the incoming question is first embedded into a vector. Once the question has been embedded into a vector, it is sent to the vector store to find the most relevant chunks. The vector store returns a few chunks that are most similar to the question vector. These chunks will then be sent to the LLM as the prompt context. The LLM then generates an answer based on the contextual knowledge from the chunks.

Indexing can be a time-consuming process, but it only needs to be done once. Once the data has been indexed, the retrieval phase can be quick. Retrieval can be done quickly, even for large datasets. This is because vector stores are optimised for fast vector search.

The Data and the Data Processing

I will use Andrew Ng’s freely available machine learning PDF book, Machine Learning Yearning, as my data. The book can be found at the following link:

Machine Learning Yearning

LangChain

LangChain is an open-source software development framework that simplifies the creation of applications that use LLMs. It can be used for tasks such as data loading, document chunking, vector stores, text embedding, and model interaction. While its presence is not required in a RAG application, it can help reduce implementation time and ensure an easily maintainable solution.

In this experiment, we are going to use LangChain to go through the whole process.

Data Loading

LangChain has tens data loaders to get the raw data into its memory. We use the PyPDFLoader to load the raw PDF files:

import os
from langchain.document_loaders import PyPDFLoader

# PDF data loading
DATA_FOLDER = '../data'

def pdf_loader(data_folder=DATA_FOLDER):
    print([fn for fn in os.listdir(DATA_FOLDER) if fn.endswith('.pdf')])
    loaders = [PyPDFLoader(os.path.join(DATA_FOLDER, fn))
               for fn in os.listdir(DATA_FOLDER) if fn.endswith('.pdf')]
    print(f'{len(loaders)} file loaded')
    return loaders

Splitter, Tokeniser, Embedding, and LLM

Once the raw PDF has been loaded into an in-memory list, it’ll be chunked by a splitter and then embedded into a high-dimensional vector. Finally, the vector will be stored in a vector database.

Here, we are going to try both OpenAI and Google VertexAI implementations. The code piece is:

from langchain.llms import OpenAI, VertexAI
from langchain.chains import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings, VertexAIEmbeddings
from langchain.vectorstores import Chroma

def build_qa_chain(platform: str = 'openai', chunk_size: int = 1000, chunk_overlap: int = 50) -> RetrievalQA:
    if platform == 'openai':
        embedding = OpenAIEmbeddings()
        splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        # splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=0)
        llm = OpenAI(model_name="text-davinci-003",
                     temperature=0.9,
                     max_tokens=256)
    elif platform == 'palm':
        embedding = VertexAIEmbeddings()
        splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        llm = VertexAI(model_name="text-bison@001",
                       project='<your own GCP project_id>',
                       temperature=0.9,
                       top_p=0,
                       top_k=1,
                       max_output_tokens=256)

Here, we will use the OpenAI text-davinci-003 model and the Google PaLM text model text-bison@001 for comparison. Both models have input token length limits. The input token limit for text-davinci-003 is 4097 and 8196 for text-bison@001.

Because eventually we will need to send the chunks along with the prompt and question to the LLM, we must limit the size of the sliced chunks. We implement this by using:

CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

We can specify the token length limit and whether we want the chunk to be overlapped to make sure there’s no information loss caused by the chunking. The embedding is done by either OpenAIEmbeddings() or VertexAIEmbeddings() according to the chosen environment.

For the OpenAI environment, the default embedding model is text-embedding-ada-002, which produces a vector of 1536 dimensions. For the GCP Vertex AI environment, the default embedding model is embedding-gecko-001, which produces a vector of 768 dimensions.

Please note that embedding models are less capable than other LLMs. For example, embedding-gecko-001 was optimised for embedding up to 1024 token inputs, while the limit for text-embedding-ada-002 is 8197 tokens. This is another important factor to consider when choosing the right chunk size.

Vector Store

    index = VectorstoreIndexCreator(
        embedding=embedding,
        text_splitter=splitter).from_loaders(loaders)

The above program allows LangChain to create a vector store. By default, the underlining vector store was implemented with ChromaDB, an easy-to-use open-source vector database.

In the vector database, the content of the chunk and the feature vector will be saved together. For people familiar with SQL databases, we can consider the vectors as the keys and the chunks as the data payload. Unlike traditional databases, the vector database only supports similarity search, not exact key matching.

The above code will fulfill the indexing process: loading the data, splitting the document into chunks, embedding the chunks, and storing them in a vector database. Here, the data will be stored in ChromaDB.

RetrievalQA

    return RetrievalQA.from_chain_type(llm=llm,
                                       chain_type="stuff",
                                       retriever=index.vectorstore.as_retriever(search_type="similarity",
                                                                                search_kwargs={"k": 4}),
                                       return_source_documents=True,
                                       input_key="question")


qa_chain = build_qa_chain('openai', chunk_overlap=0)

The retrieval phase will be covered by the agent created by the code above. In LangChain’s term, it is the RetrievalQA chain. The agent takes in the user’s query, embeds the query into a vector, retrieves relevant document chunks from the vector store, sends the relevant document chunks to the LLM and eventually passes the LLM completion to the user.

It’s worth inspecting the parameters to understand how the retriever was built:

llm: defines the LLM model to use.
retriever: defining from which vector store to retrieve information and by which policy. It has two additional parameters:
- search_type: how to select the chunks from the vector store. It has two types: similarity and MMR. Similarity means selecting the most similar chunks to the query. MMR also does similarity searches. The difference is that MMR will diversify the selected chunks rather than return a very closed result.
- search_kwargs.k: which defines the number of chunks to be selected. In the code piece above, the retriever will use a similarity search to collect 4 candidates.
chain_type: this is specifying how the RetrievalQA should pass the chunks into LLM.
- stuff means inserting the candidate chunks into a single prompt to send to the LLM.
- map_reduce means sending the chunks to LLM in separated batches and comes up with the final answer based on the answers from each batch
- refine means separating texts into batches, feeding the first batch to LLM, and feeding the answer and the second batch to LLM. It refines the answer by going through all the batches.
- map_rerank means separates texts into batches, feeds each batch to LLM, returns a score of how fully it answers the question, and comes up with the final answer based on the highest-scored answers from each batch.
return_source_documents: whether to return the document in the result. Including the documents will be helpful for understanding how the system works.
input_key: the input is a JSON string. The input_key specifies what JSON key is leading the query.

Answering

Once the building blocks are handy, making it run is just one line of code:

result = qa_chain({'question': 'What is the difference between L1 and L2 regularization?', 'include_run_info': True})
print('Q:', result['question'])
print('A:', result['result'])

After running the code, I got the following output:

['andrew-ng-machine-learning-yearning-1.pdf']
1 file loaded
4
Time span for building index: 5.056659261
Time span for query: 2.923764822999999
Q: What is the difference between L1 and L2 regularization?
A:  L1 regularization adds a penalty proportional to the absolute value of the weights to the cost function, while L2 regularization adds a penalty proportional to the square of the weights to the cost function.

The Full Code Piece

import os
from time import perf_counter
from langchain.document_loaders import PyPDFLoader
from langchain.llms import OpenAI, VertexAI
from langchain.chains import RetrievalQA
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings, VertexAIEmbeddings
from langchain.vectorstores import Chroma

DATA_FOLDER = '../data'


def pdf_loader(data_folder=DATA_FOLDER):
    print([fn for fn in os.listdir(DATA_FOLDER) if fn.endswith('.pdf')])
    loaders = [PyPDFLoader(os.path.join(DATA_FOLDER, fn))
               for fn in os.listdir(DATA_FOLDER) if fn.endswith('.pdf')]
    print(f'{len(loaders)} file loaded')
    return loaders


def build_qa_chain(platform: str = 'openai', chunk_size: int = 1000, chunk_overlap: int = 50) -> RetrievalQA:
    if platform == 'openai':
        embedding = OpenAIEmbeddings()
        splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        # splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=0)
        llm = OpenAI(model_name="text-davinci-003",
                     temperature=0.9,
                     max_tokens=256)
    elif platform == 'palm':
        embedding = VertexAIEmbeddings()
        splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        llm = VertexAI(model_name="text-bison@001",
                       project='<your own GCP project_id>',
                       temperature=0.9,
                       top_p=0,
                       top_k=1,
                       max_output_tokens=256)

    loaders = pdf_loader()
    index = VectorstoreIndexCreator(
        embedding=embedding,
        text_splitter=splitter).from_loaders(loaders)
    print(len(index.vectorstore.get()))

    # Prepare the pipeline
    return RetrievalQA.from_chain_type(llm=llm,
                                       chain_type="stuff",
                                       retriever=index.vectorstore.as_retriever(search_type="similarity",
                                                                                search_kwargs={"k": 4}),
                                       return_source_documents=True,
                                       input_key="question")


tick = perf_counter()
qa_chain = build_qa_chain('openai', chunk_overlap=0)
print(f'Time span for building index: {perf_counter() - tick}')

# get reply to our questions
tick = perf_counter()
result = qa_chain({'question': 'What is the difference between L1 and L2 regularization?', 'include_run_info': True})
print(f'Time span for query: {perf_counter() - tick}')

print('Q:', result['question'])
print('A:', result['result'])
print('\n')
print('Resources:', result['source_documents'])

ChatGPT vs. VertexAI

So far, the RAG is working with both GCP and OpenAI LLMs. Let’s do an experiment to compare their performance.

Speed

The indexing phase of a RAG is much slower than querying because every chunk has to be processed. To my surprise, I found that in both indexing speed and query speed, OpenAI performs much better than Google VertexAI.

+-----------+--------+----------+
| Operation | OpenAI | VertexAI |
+-----------+--------+----------+
| Indexing  |    3.8 |    29.3  |
| Querying  |    1.0 |     2.8  |
+-----------+--------+----------+

Please remember that the indexing and querying phases are using different models. The speed of the indexing depends on the performance of the embedding model, while the speed of queries during the querying phase depends on the performance of the large language model (LLM). And also, both OpenAI and VertextAI, both embedding and LLM have a rate limit to protect the model from being overloaded. And VertexAI has a much tighter rate limit:

Rate limit of OpenAI models:

1 TPM equals 1 token per minute for the davinci model and 200 tokens per minute for the ada model— the one we use for the embedding.

Rate limit of VertexAI models:

Comparing the spec of the embedding models of OpenAI and VertexAI, we can see the OpenAI embedding rate limit is 10X better than VertexAI which aligns with the indexing speed comparison.

Besides the indexing speed mainly caused by the rate limits, we can see that VertextAI LLM is also runs slower than the ChatGPT competitor.

Answering Quality

Let’s compare the answers of five questions:

Question Answering from ChatGPT and PaLM

Both ChatGPT and PaLM can extract sensible insights from the PDF document. They can answer questions 2 and 3 very well. PaLM’s answer to question 1 is slightly better than ChatGPT; ChatGPT’s answer to question 4 is much better than PaLM.

What I like the most is the answer to question 5. Andrew Ng’s book is focused on the ML method rather than specific models. He didn’t mention SVM in this book. From the answers, we can see that PaLM has made up an irrelevant response, while ChatGPT just simply admitted that no information can be found from the data.

Based on the five questions, I believe ChatGPT is still leading the competition.

Conclusion

The advance of LLM has given us the incredible ability to analyze unstructured data in a new way. However, the technology is still not fully ready for massive enterprise applications yet. There are a few challenges:

Throughput: Large language models (LLMs) are very costly and slow, which makes large-scale deployment difficult. This may continue to be the case for a long time, until groundbreaking technologies revolutionize the current architecture. A better solution is to use a set of models to balance speed and cost, as discussed in this post.
Varied completion quality: there are many commercial and open source LLMs available, but they perform differently on different tasks. In real-world application development, we need to evaluate the LLM performance according to the project requirements. We need to choose the right LLM for the task at hand. We also need to consider the size and complexity of the project. We need to evaluate the LLM performance carefully before we decide to use it in a real-world application.

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models

https://medium.com/@kelvin.lu.au/frugalgtp-a-low-cost-high-performance-building-block-for-sophisticated-llm-applications-643fbfb3a7a8

Compare PDF Question Answering Systems Build with OpenAI and Google VertexAI

Written by Kelvin Lu