Integrating ChatGPT with internal knowledge base and question-answer platform

Bring the power of ChatGPT to internal knowledge management

--

ChatGPT is highly adept at providing general information, albeit with some limitations. Meanwhile, internal knowledge management is becoming more and more critical in a post-pandemic world with hybrid working and higher employee turnover, according to Gartner. How can we bring the power of ChatGPT to internal knowledge management?

In this post, I will outline a simple method to achieve this. In our case, we have an internal knowledge base and an internal Q&A platform similar to StackOverflow.

Before we dive in, let me clarify some terms. ChatGPT is a chatbot powered by a series of GPTs, first GPT-3.5 and now GPT-4. GPTs (Generative pre-trained transformers) are language models under the big umbrella of Large Language Models (LLMs), which also include Google’s LaMDA and PaLM models (which power Bard chatbot) and open models like BLOOM or GPT-Neo-X. These are AI models purpose-built for various natural language processing tasks, such as text generation, classification, question-answering, summarisation, and translation.

“Large” refers to the model’s size, which is measured by the number of parameters it contains. GPT-3 has 175 billion parameters, while GPT-4 is expected to have an order of magnitude more. Google’s LaMDA, which powers Bard, has 137 billion while the new model, PaLM, has 540 billion.

The case for customisation

Let’s test ChatGPT with a simple query related to a product from our Singapore Government Tech Stack (SGTS).

Below is the response from ChatGPT (GPT-3.5):

At least it admits that it doesn’t know. Let’s try to provide more context:

ChatGPT’s answer is horrendously wrong. SHIP stands for Secure Hybrid Integration Pipeline, while HATS Hive Agile Testing Solutions. They comprise a suite of CICD tools.

This situation with ChatGPT is known as bot hallucination, where the bot gives a grammatically correct but factually inaccurate answer.

The idea

To enhance an existing Large Language Model with custom knowledge, there are 2 main methods:

  • Fine-tuning: further training the model using the custom dataset.
  • In context learning: supplying the necessary information from the custom dataset related to user query while querying.

Fine-tuning provides a high degree of accuracy and completeness but it requires significant time and resources to train and host the custom model. In-context learning, on the other hand, offers greater flexibility and costs much less, but it’s limited by the model’s token limit.

As of writing, assuming a relatively-small custom dataset of 1M tokens (~200,000 words or ~400 Wikipedia articles) and an average query using 1000 tokens, fine-tuned Davinci costs $30 for training and 12 cents per query. In contrast, in-context learning costs <$1 for embedding, and at most 1 cent per query (assuming 4000 tokens per query due to additional context).

According to some sources, ChatGPT has been trained on ~500 billion words. That is 2.5 million times of the sample size above.

In-context learning is clearly a simpler solution.

Prompt Engineering and Retrieval Augmented Generation

We can already enhance the quality of ChatGPT’s answer simply by providing it additional context. There’s a whole new discipline dedicated to this called prompt engineering, aimed at developing and optimising prompts to use language models efficiently.

Prompt engineering applied to the case of an internal knowledge base means feeding relevant data from the knowledge base to ChatGPT every time we interact with it. You can imagine how troublesome this can quickly become. The process needs to be automated.

This is where Retrieval Augmented Generation workflow comes in.

The idea is simple. Instead of asking a question directly, the process first uses the user question to perform a search to retrieve relevant documents from the internal dataset and then provides these documents together with the question to ChatGPT. With the additional context, ChatGPT can answer as though it has been trained with the internal dataset.

So roughly speaking, instead of simply <query> it’d be:

answer following question given <relevant texts>, <query>

Here’s a simple diagram for this process:

Implementation

Following is an implementation of RAG Q&A in Python. The Jupiter notebook is hosted on Google Colab.

The dataset

For demo purpose, I’m using public data provided by GovTech SGTS team. These are documents for https://www.developer.tech.gov.sg hosted in the GitHub repo https://github.com/GovTechSG/developer.gov.sg.

The workflow (aka chain)

I use the Langchain library, which enables chaining together multiple capabilities including integration with LLMs (e.g. ChatGPT). In this case, it allows me to implement the Retrieval-Augmented Generation Q&A process as described above. Alternatives to Langchain include LlamaIndex (built on top of Langchain) and Haystack.

The document database

The document database we need to build is going to be a vector database instead of a traditional database.

A vector database indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling. Vector databases excel at similarity search, or “vector search.” Vector search enables users to describe what they want to find without having to know which keywords or metadata classifications are ascribed to the stored objects. Vector search can also return results that are similar or near-neighbor matches, providing a more comprehensive list of results that otherwise may have remained hidden. — https://www.pinecone.io/learn/vector-database/

To store documents as vectors, a vector database requires a process called embedding to convert each word into a vector of hundreds or thousands of different dimensions. For example, OpenAI Ada embedding results in over 1500 dimensions.

This also means that building a database incurs some costs, depending on the size of the database. But this is negligent compared to the training cost of a fine-tuned model. With 1,000,000 tokens, it costs only 40 cents (contrast that with $30,000 for training).

I use the FAISS library as the database, which stands for Facebook AI Similarity Search. I can save the FAISS database to local files and load it later on for querying. This cuts down the cost of building the database. Other than FAISS, Langchain supports Chroma, Pinecone, Weaviate, OpenSearch, and many others.

Building the database

Install required packages

Install langchain openai faiss-cpu packages.

Set up OPEN_API_KEY and necessary variables

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key here and hit enter:")

REPO_URL = "https://github.com/GovTechSG/developer.gov.sg" # Source URL
DOCS_FOLDER = "docs" # Folder to check out to
REPO_DOCUMENTS_PATH = "collections/_products/categories/devops/ship-hats" # Set to "" to index the whole data folder
DOCUMENT_BASE_URL = "https://www.developer.tech.gov.sg/products/categories/devops/ship-hats" # Actual URL
DATA_STORE_DIR = "data_store" # Folder to save/load the database

I specify a small subset with the path to keep the database small for testing. In actual application, I include everything under collections/_products.

Clone the GitHub repo

git clone $REPO_URL $DOCS_FOLDER

Load documents and split them into chunks for conversion to embeddings

repo_path = pathlib.Path(os.path.join(DOCS_FOLDER, REPO_DOCUMENTS_PATH))
document_files = list(repo_path.glob(name_filter))

def convert_path_to_doc_url(doc_path):
# Convert from relative path to actual document url
return re.sub(f"{DOCS_FOLDER}/{REPO_DOCUMENTS_PATH}/(.*)\.[\w\d]+", f"{DOCUMENT_BASE_URL}/\\1", str(doc_path))

documents = [
Document(
page_content=open(file, "r").read(),
metadata={"source": convert_path_to_doc_url(file)}
)
for file in document_files
]

text_splitter = CharacterTextSplitter(separator=separator, chunk_size=chunk_size_limit, chunk_overlap=max_chunk_overlap)
split_docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(split_docs, embeddings)

That’s it, we have vector_store as our database. The documents are split into chunks of roughly 1000 tokens each and then sent to OpenAI for embedding before being stored in the FAISS database.

An optional step is to save this to local files for future reuse with: vector_store.save_local(DATA_STORE_DIR)

And reload it using: vector_store = FAISS.load_local(DATA_STORE_DIR, OpenAIEmbeddings())

Querying

Now that we have built the database, this is the fun part where we can query our custom data.

Set up the chat model and specific prompt

from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)

system_template="""Use the following pieces of context to answer the users question.
Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
If you don't know the answer, just say that "I don't know", don't try to make up an answer.
----------------
{summaries}"""
messages = [
SystemMessagePromptTemplate.from_template(system_template),
HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, max_tokens=256) # Modify model_name if you have access to GPT-4
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(),
return_source_documents=True,
chain_type_kwargs=chain_type_kwargs
)

def print_result(result):
output_text = f"""### Question:
{query}
### Answer:
{result['answer']}
### Sources:
{result['sources']}
### All relevant sources:
{' '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}
"""

The main part of the above code is the setup of the RetrievalQAWithSourcesChain object with OpenAI’s gpt-3.5-turbo model as LLM and our vector_store database as the retriever. The prompt can be further customised for different use cases. Note also that we set the model’s temperature to 0 to make it stick to the context.

Use the chain to query

query = "What is SHIP-HATS?"
result = chain(query)
print_result(result)

Result

There we have it, the answer is correct.

In addition, it also provides the source data, which is this page: https://www.developer.tech.gov.sg/products/categories/devops/ship-hats/overview

This is extremely useful for the user to verify the answer when in doubt or to read up and find out more about the search topic. This is an important advantage over plain vanilla ChatGPT.

What happened behind the scene? — OpenAI API call

To understand what happened, it’s useful to look at the payload for the OpenAI API call here:

Relevant chunks of texts and their source have been added to the system message. This is the secret to the context-aware answer from ChatGPT.

The complete Jupiter notebook is hosted on Google Colab.

Further Enhancements

We have had a glimpse at bringing ChatGPT to internal knowledge management. As of writing, I am working on many additional enhancements.

Integration as a Telegram or Slack Bot

Now that we have the main workflow done, we can easily create a bot for it on platforms like Telegram or Slack. The bot is easily accessible by our engineering teams who use either Telegram or Slack as main communication tools.

Here’s an example on Telegram:

Integration with Q&A Platform

In my department, we have an internal Q&A platform codenamed Hivemind, which is similar to StackOverflow. The bot can be trained with additional data from Hivemind.

Furthermore, the system can allow users to select question+answer pairs to be posted to Hivemind and to post questions that the bot fails to answer on Hivemind. Over time, the bot will have a sizeable set of answers and become better as our central knowledge guru.

This diagram describes the integration between the bot and Hivemind:

Enhancing Data Security

A majority of our internal knowledge base is sensitive. As a result, using an LLM service like ChatGPT doesn’t satisfy data security requirements since a subset of the data is sent to OpenAI. We are also considering different ways to host an LLM model in our Government cloud.

What’s next

We have only touched the tip of the iceberg in the process of bringing LLM power to internal knowledge management. There’s much more to do with this, including using different LLMs or locally hosted models for data security as mentioned above, using human-assisted answers to improve accuracy and adding multimodal capabilities to support images, videos, speech.

As demonstrated, LLMs provide the potential to build powerful applications very quickly. We are only at the very beginning of a Cambrian explosion of LLM-powered applications. At the same time, there’re numerous ethical and social risks with language models. We can either stay out of the race or join it to better comprehend the powers and implications so that we can contribute meaningfully to the safe advancement of AI.

Thank you for reading. Do comment below to share your thoughts.

The complete Jupiter notebook is hosted on Google Colab.

--

--

Quy Tang
Government Digital Services, Singapore

A drop in a river, a part in a community, a student of mindfulness and compassion, towards a kinder, wiser global community