Supercharge your knowledge base with a natural language Q&A platform!

Leverage ChatGPT (LLM) and LangChain meaningfully and responsibly

ChatGPT needs no further introduction. It is insanely popular and has become the fastest-growing consumer application in history despite just being launched for mere months and is still growing in capabilities everyday. However, its knowledge is limited to what it has been trained on, which is generally available internet data up until 2021. It is also unable to access private nor recent data e.g., an internal knowledge base in the organization.

To facilitate the rest of this post, it’s useful to understand that ChatGPT is a chatbot created by the OpenAI, that runs on Generative Pre-trained Transformers (GPTs), which is a class of Large Language Models (LLMs). Some common tasks which you and I have probably tried on ChatGPT’s interface are text summarization, translation, text generation (think dad jokes).

In this post, I wanted to document some of my thoughts and experimentations with the use of LLMs (in particular, ChatGPT) and LangChain to provide a natural language Q&A platform over a knowledge base (curated from both publicly available data that is always being updated and internal private data sources).

Reimagining the Need for a New Solution

Allow me to start with the use case and explain the conundrums I faced, and how this has ultimately led me up to this point where I see a completely new way of solutioning with LLMs.

I used to develop an app that curates and provides drivers with the latest parking rates across major carparks in Singapore. It started as an experiment and a hobby, but eventually the need to update the database of parking rates was just taking too much of my time and was not sustainable. It was amazing (and still amazes me today) how frequent the carparks in Singapore change their pricing! If only there was a less resource intensive way to curate publicly available datasets of carpark rates (e.g. OneMotoring, websites of major malls, hospitals, etc) and mash that up with privately curated data (e.g. feedback that I receive from users, my own visits to various destinations).

When ChatGPT was launched, I thought this could be the solution. While a response was generated when I did a test, it most certainly did not have the updated rates, and rightly so as it’s trained on internet data up until 2021. To get this to work, I need a way to “update” ChatGPT with the latest data.

Helping ChatGPT “Learn” to Provide More Accurate Responses

There are generally two approaches to enhance an existing LLM with custom knowledge.

  1. Fine-tuning involves further training of the LLM model with customized datasets, which can enhance its precision and comprehensiveness. However, developing and deploying a custom model requires substantial time and compute resources.
  2. Prompt engineering entails providing relevant subset of information from the customized dataset pertaining to the user’s query during the querying process. This approach is a lot simpler and cost-effective, but its effectiveness is constrained by the token limit of the model i.e. there is a ceiling to the amount of contextual data that can be fed to the model. In this case, prompt engineering is certainly the simpler and cheaper solution.

After mucking around with OpenAI, LangChain, LlamaIndex, Pinecone documentation and a few Colab notebooks, I had an architecture and approach in mind:

  1. Crawl and download content of publicly available websites
  2. Chunk and convert the content from the websites into vector embeddings, and store them into a vector database for fast retrieval to service subsequent user requests
  3. Perform Retrieval Augmented Generation — for each user request, retrieve the embeddings from the vector database that are most similar to the request and send them as contextual information as part of the query to ChatGPT
  4. ChatGPT provides a response (hopefully, accurate parking rates that the user queried for)

As a visual representation:

Architecture for Natural Language Q&A Platform for Parking Rates using ChatGPT, LangChain and Pinecone {Desmond Loh}

As a sidenote, I recommend using LangChain (it’s the one I tried) as an interface to work with LLMs. It abstracts away some of the complexities and provides a consistent API if one ever decides to use other LLMs instead of ChatGPT. Furthermore, it’s also a great toolkit and offers utilities like Loaders and connectors to a variety of vector databases.

Implementing Step 1 — Crawl and download content of publicly available websites

As described earlier, there are many sources of publicly available datasets of carpark rates, such as those from LTA/OneMotoring and websites of major malls and hospitals. This step is made really simple with the tools provided in LangChain. There isn’t even a need to extract the relevant data (e.g. table of parking rates) from the webpage. With the power of LLM, we can just consume the whole page and let the LLM “figure it out” later.

from langchain.document_loaders import UnstructuredURLLoader

# URLs to crawl data from
urls = [
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.1.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.2.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.3.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.3.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.4.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.5.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.6.html',
'https://onemotoring.lta.gov.sg/content/onemotoring/home/owning/ongoing-car-costs/parking/parking_rates.8.html'
]

# Crawl and load data from URLs
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

Implementing Step 2— Chunk & Convert Content into Vector Embeddings, and Store them into a Vector Database

In this step, we chunk the data that we crawled earlier into smaller fixed size documents because LLMs generally have a limit to the amount of text they can deal with. These chunks are then converted into vector embeddings, which essentially is a process of converting each word into a vector of many different dimensions. This process is done with OpenAI’s embedding model (text-embedding-ada-002) and calling it’s API costs money based on the number of tokens being converted. These embeddings are then finally stored into Pinecone (a SaaS vector database) to facilitate fast similarity search and retrieval.

import os
import pinecone

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.callbacks import get_openai_callback

# Defininng some constants
EMBEDDING_MODEL = "text-embedding-ada-002"
PINECONE_INDEX = "carparks"
CHUNK_SIZE = 1000

# Initialize pinecone
pinecone.init(
api_key = os.environ.get("PINECONE_API_KEY"),
environment = "asia-southeast1-gcp"
)

# Chunk data
text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=0)
documents = text_splitter.split_documents(data)

# Generate vector embeddings and index them into Pinecone
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorDB = Pinecone.from_documents(documents, embeddings, index_name=PINECONE_INDEX)

Implementing Step 3 — Perform Retrieval Augmented Generation

We are now ready to accept queries from users and get ChatGPT to reply. As a user queries for the parking rates of a venue, we perform a similarity search in Pinecone to retrieve the more relevant chunk of data and send them along with the user’s query as a complete prompt to ChatGPT. In which case, I chose to use the latest model available publicly — gpt-3.5-turbo.

import os
import pinecone

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone

# Defininng some constants
CHATGPT_MODEL = "gpt-3.5-turbo"
CHATGPT_MODEL_TEMPERATURE = 0
CHATGPT_MODEL_MAX_TOKENS = 1000
EMBEDDING_MODEL = "text-embedding-ada-002"
PINECONE_INDEX = "carparks"

# Initialize pinecone
pinecone.init(
api_key = os.environ.get("PINECONE_API_KEY"),
environment = "asia-southeast1-gcp"
)

# Load indexes of vector embeddings from Pinecone
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorDB = Pinecone.from_existing_index(embedding=embeddings, index_name=PINECONE_INDEX)

# Create prompt with context from our crawled data that has been converted to vector embeddings
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)

system_template="""Use the following context to answer the users question. If you don't know the answer, just say that "I do not have the relevant information", and do not try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name=CHATGPT_MODEL, temperature=CHATGPT_MODEL_TEMPERATURE, max_tokens=CHATGPT_MODEL_MAX_TOKENS)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorDB.as_retriever(),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs
)

def print_result(result):
output_text = f"""
[Your Question]:
{query}

[Answer]:
{result['answer']}

[All relevant sources]:
{' '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
"""

Implementing Step 4 — ChatGPT Provides a Response

Time to take our bot out for a test drive!

Query 1: Let’s start with something simple.

query = "Parking rates for Bedok Mall?"
result = chain(query)
print_result(result)

Results 1: All the facts have been regurgitated. Including the info on the website that stated rates were updated as of Jan 22.

Query 2: Increasing the difficulty by specifying the time period we are interested in.

query = "What are the parking rates for 313@Somerset on Saturday?"
result = chain(query)
print_result(result)

Results 2: Great work. But noticed that the style of the response is not consistent. Nonetheless, the results are accurate and so are the sources.

Query 3: Let’s get the bot to contextualize what day is it today.

query = "Parking rates for Goodwood Park Hotel today?"
result = chain(query)
print_result(result)

Results 3: Unfortunately, this doesn’t seem to work as I expected. However, the source has at least been identified correctly.

Query 4: Lastly, let’s ask ChatGPT for something that is not in the dataset.

query = "Parking rates for Jem?"
result = chain(query)
print_result(result)

Results 4: As expected, it replies that it doesn’t know and doesn’t hallucinate. But what’s unexpected is the source being quoted. Perhaps it’s a bug on my end. Will need to figure this one out.

Reflecting on the exercise and future works

If we get back to the initial problem that I was trying to solve, this is definitely a potential solution (although it might be an overkill) where publicly available websites and internal datasets can be quickly and loosely curated without the need for additional efforts to extract, transform and load data points into traditional SQL databases.

There’s still lots more that needs to be done to get this into production quality. Some future works include:

  1. Have ChatGPT generate more consistent and professional looking replies. One way to do this could be to incorporate few-shot in-context learning by providing sample query and responses for ChatGPT to “follow” for subsequent replies.
  2. Have ChatGPT understand the context of weekdays, weekends and today. If we can’t rely on ChatGPT for this (still not sure if it’s just a bug that I need to fix somewhere), we could translate it from the software/UI layer e.g. converting today to the actual day of the week before passing it on to ChatGPT.
  3. Web/Mobile app interface for users to interact with. I have some interest to play around with Svelte, but let’s see how this goes.
  4. Reducing the costs of using ChatGPT. There are 2 types of costs — embeddings generation and completions (aka ChatGPT replies after you send your prompt over), and both are priced based on the number of tokens that ChatGPT has to process. On the completion side, one way to reduce cost could be to add a caching layer so that repeated queries can be answered by a cache instead of going through ChatGPT.

Finally, it has been great fun tinkering with the various toolkits and APIs. After understanding the concepts (which was the more challenging part), it wasn’t too difficult to get started as a lot of it was orchestrating the sending of data across different APIs and platforms. However, it would probably take a lot more effort and time to fully understand the nuances of the various parameters (e.g. temperature, token sizes) that we have access to, and their impact to the accuracy, cost and generation speed of the final response from the LLM.

I would love to hear your comments and suggestions, especially if it could help me shortcut the learning/troubleshooting process :)

Until the next post!

--

--

Reflections & Ideas - Desmond Loh

Web 2, Web 3, Digital enthusiast. Disciple on personal finance. Pupil of leadership & management theories. Perpetual wanderlust.