Chat with your confluence

Published in

Badal-io

24 min readAug 3, 2023

The narrative is familiar — businesses across the board grappling with the mammoth task of data retrieval, losing valuable man-hours in the process. The struggle is real, and so is the chatter about it. Countless discussions revolve around using large language models (LLMs) to create Q&A systems for enterprise data as a solution to this age-old problem. However, amidst this echo chamber, our focus is firmly on delivering an impactful, tangible solution at scale and with speed.

In part 1 of this series’ blogs, we went over what LangChain and LlamaIndex are, spelled out how to use each, and highlighted their differences and when to use each. In this second installment of the series, we walk through a chatbot that uses your confluence as a data source to answer questions. We will showcase how to leverage LangChain, vector databases and LLMs to build a quick, practical Minimum Viable Product (MVP) on Google Cloud Platform (GCP) that promises to redefine your operational efficiency. Part 3 of this blog series will focus on integrating Slack or your team’s chat UI to establish a chatbot-like experience.

This isn’t just a new toy for your tech team; it’s a practical solution designed to streamline your entire business process, freeing your team members — be they new hires, project leads, or HR professionals — from the time-sink of document trawling.

Here are the steps we will walk through:

Retrieving and preprocessing documents from a confluence
Ingesting the documents into a Firestore as the document storage
Indexing: obtaining and storing embeddings of documents into GCP’s matching engine as the vector database
Using PaLM as the LLM to generate answers based on the relevant documents to each question

Retrieving and Pre-processing confluence

Atlassian offers both a REST API and a Python API to work with your documents programmatically. We use the Python API to retrieve our target space(s) in HTML format:

from atlassian import Confluence

# Initialize confluence with basic authentication
# replace username and password with your own
confluence = Confluence(
    url='https://mywiki.atlassian.net/',
    username=username,
    password=password,
    cloud=True)


# we will index the HR space as an example
space = 'HR'

# Below we set limit=500 to retrieve all the pages at once
# However you can write a for loop to batch load the pages if
# there is a ton of them
pages = confluence.get_all_pages_from_space(space, 
                                            start=0, 
                                            limit=500, 
                                            expand="body.storage", 
                                            content_type='page')


# Each item in `pages` is a dictionary that holds lists, nested dictionaries,
# strings, etc. as metadata for that page (including id, title, etc.)
# The main html content of the page is retrieved via this nested dictionary:
print(pages[i]['body']['storage']['value'])

Now that we have access to the HTML content of each page, how do we clean or pre-process it? The strategies vary from org to org and one can spend several hours perfecting this step (and rightfully so). Some items to consider when parsing your HTML are tables, attachments (images), links, mentions (@person), anchors and links, unordered and ordered lists, etc. Start simple and evolve your pre-processing one item at a time. Honing in on tables, for example, you can use an LLM to convert the information in your table to natural language so your embedding model in the next step will better handle encoding it into vectors. Similarly for images, an image captioning model can generate the images with captions so that the embedding model can encode them along with other texts. Later, if this vector is retrieved as a relevant item to a certain query, we can go back and fetch the corresponding image from our database so it can be included in the answer to the query.

To keep it reasonably simple, we just use the html2text library to only extract the text content in markdown format. If your confluence pages are properly populated, this will preserve adequate information for our embedding model to work with. It is worth noting that markdown is helpful since it keeps some of the hierarchical structure of the page which we can leverage to generate higher quality embeddings down the line.

Loading via LlamaIndex loader

Alternatively, we can use LlamaIndex’s confluence loader. This loader uses the html2text package under the hood to convert the pages to markdown.

from llama_index import download_loader

ConfluenceReader = download_loader('ConfluenceReader')

base_url = "https://mywiki.atlassian.net/wiki"

space_key = "HR"

reader = ConfluenceReader(base_url=base_url)
documents = reader.load_data(space_key=space_key, include_attachments=False)

LlamaIndex offers a host of loaders for a variety of data sources.

Loading via Langchain loader

Another option would be to use LangChain’s confluence loader. However, last I checked this loader uses BeautifulSoup’s .get_text() method to dump the text content of a page into a variable. Therefore no structure is preserved like markdown and is sub-optimal.

Some things to consider when creating embeddings for your documents

We need to split our documents into smaller chunks as embedding models only accept a limited-size text as input. In other words, you can’t dump an entire book into an embedding model and expect it to generate a vector. Moreover, the context size of the LLM that will have to deal with these documents, later on, is limited (unless you are using Anthropic’s Claud). However, above all this, there is a more crucial reason.

It is important to remind ourselves of the end goal. The entire solution heavily depends on the quality of vector embeddings. When a query is asked, it is the job of our vector database to retrieve the most relevant vectors to that query. If the vectors stored in our vector db are shabby, the wrong set of documents will be retrieved. This wrong set will then be fed to the LLM to generate a human-like answer to the query. Therefore the LLM will generate nonsense since it wasn’t fed the right information in the first place. Crisp clean accurate embedding vectors are paramount here.

Remember that the input query is encoded into an embedding and this vector is used to search our vector db for the most similar vectors. An input query is usually a short piece of text directed at one topic. Therefore, the vectors revolving around that topic will be fetched. If the documents we index into our vector db are too long and cover a diversity of topics, their embedding will be too generic and therefore will either match all queries or no queries at all. On the other hand, if the documents are too short, a lot of the context will be missing, which will result in poor embeddings as well. Ideally, we are looking to break down our documents so that each piece contains sufficient context and revolves around one or a few similar topics so that the embedding model can generate an accurate and crisp embedding for the document.

One can use a sophisticated topic modeling method to identify coherent sections on each page and split the page based on those sections. Fortunately for Confluence, an on-par heuristic is available as each page is already divided into Headings and subheadings:

This tree-like structure is helpful as each heading/subheading/sub_subheading likely discusses a certain topic. We can split based on level 1 headings, subheadings, or if we feel the need to get even more granular sub_subheadings. We use LangChain’s markdown splitter for this purpose. Note that Character splitters and token splitters are very popular for a quick and dirty POC but do not preserve the structure nor do they heed the topics.

Ingesting the documents into a Firestore as the document storage

First off, let’s clarify why we need backend storage along with a vector db. Usually vector databases (with a few exceptions) only store vectors and document id’s:

When this vector is matched, how can we access the document corresponding to this vector and feed it to the LLM? You might call the Confluence API with the page Id to retrieve the document but that just adds more latency plus redoing the preprocessing step before the document is sent over to the LLM. Therefore, it is well worth it to spend a bit more on a document storage that will hold our pre-processed documents and their metadata.

The code below uses LangChain’s markdown splitter to chunk and store the documents in Firestore. Please feel free to ask any questions about the code.

# Initialize the markdown splitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2")
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Instantiate the firestore client
from google.cloud import firestore
db = firestore.Client()



base_url = "https://mywiki.atlassian.net/wiki/spaces/HR/pages/"
for document in documents:
    title = document.extra_info['title']
    content = document.text
    source = 'confluence'
    doc_id = document.doc_id
    url = base_url + doc_id
    
    data = {'title': title, 
            'source': source, 
            'doc_id': doc_id, 
            'url': url,
            'Header 1': '', 
            'Header 2': ''
            }
    
    md_header_splits = markdown_splitter.split_text(content)
    for i, split in enumerate(md_header_splits):
        data['sub_id'] = i
        data.update(split.metadata)
        
        if not (data['Header 1'] or data['Header 2']):
            data['content'] = f"Introduction to {data['title']}:\n" + split.page_content
        else:
            data['content'] = f"{data['title']}\n\tsubsection:{data['Header 1']}:\n\tsub_subsection:{data['Header 2']}:\n" + split.page_content
        
        db.collection('mycollection').document(f"conf{doc_id}_{i}").set(data)

Note that in the code above, the text content must be stored in a field named ‘content’ for the following LangChain section to work.

Indexing: obtaining and storing embeddings of documents into GCP’s matching engine as the vector database

Matching Engine is the blazing-fast vector database on GCP which is now supported by both LangChain and LlamaIndex as a vector database. All it needs to create an index over your data is a JSON list. This JSON list is simply a .json file where each line contains an id corresponding to a document and the vector embedding of that document:

{"id": 'conf5807875_0', "embedding": ["-0.027", "-0.026", "-0.007", "0.058"]}
{"id": 'conf5807876_1', "embedding": ["-0.009", "-0.010", "-0.016", "0.055"]}
{"id": 'conf2918594_0', "embedding": ["-0.002", "-0.045", "-0.008", "0.006"]}
.
.
.

We prepare this json list for our documents by stepping through our database. For id, we simply use the primary key of the firestore data, i.e., conf{doc_id}_{i} in the code above where doc_id is the id of that confluence page and i is the subsection id (which header in that page is this document from). Note that we need to have both our vector db (Matching Engine) and our document store (Firestore) have the same primary key so we can use LangChain question answering.

Note that you can tag and label your vectors for filtering purposes just by adding more fields to the json. Read more here.

PaLM as the embedding model

PaLM is Google’s LLM available through the Vertex AI library. PaLM’s embedding model is optimized for input texts of sizes up to 1024 tokens and generates 768-dimensional vectors. PaLM generation model, however, has a context size of 8196 tokens which is quite larger than ChatGPT. Find more specifications here.

It is unlikely but although we have chunked our confluence pages into headings and subheadings, there may be some chunks that are larger than 1024 tokens. One solution for this could be to break down the document even further into sub_subheadings or subsubsubsubsubheadinsg to ensure each chunk is shorter than 1024 tokens.

Another solution is a sliding window function that generates embeddings by averaging the embeddings of segments of size 1024 with some overlap, kind of like convolutions:

import time
import numpy as np
from tenacity import retry, stop_after_attempt, wait_random_exponential
from vertexai.preview.language_models import TextEmbeddingModel
import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

window_size = 1024 # tokens
overlap_size = 100 # tokens
step = window_size - overlap_size

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(5))
def embedding_model_with_backoff(text=[]):
    embeddings = embedding_model.get_embeddings(text)
    return embeddings

def get_batch(text, batch_size=5):
    batch = []
    tokens = tokenizer.encode(text)
    for i in range(0, len(tokens), step):
        window_text = tokenizer.decode(tokens[i:i+window_size])
        batch.append(window_text)
        if len(batch) == batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

def PaLM_embed(text):
    vector_list = []
    text = text.strip()
    if text != '':
        for batch in get_batch(text):
            try:
                vector_objects = embedding_model.embedding_model_with_backoff(batch)
            except Exception as e:
                print(f"error: {e}") 
            vectors = list(map(lambda x: x.values, vector_objects))
            vector_list.extend(vectors)
             
        return np.mean(vector_list, axis=0)
    else:
        raise ValueError("Empty document can't be embedded")

We can think of ways to improve upon this simple averaging. For instance, instead of a simple average, we can use a weighted average where the beginning and the end of a segment are assigned larger weights. Or some would argue that only the beginning of the segments are material and therefore we can truncate all segments to their first 1024 tokens. This is one area where some experimentation can surface the best schema in my opinion.

All that said, I believe that 1024 tokens which translates to roughly 768 words is enough for even long sub-sections of a confluence page. So you can simply skip the above and do:

embedding = embedding_model.get_embeddings([text])

If your subsections are much longer, you should consider spending some time organizing and cleaning up that confluence space.

Now that we have our embedding function, we can create the .json file for the matching engine:

import json
from google.cloud import firestore
db = firestore.Client()
docs = db.collection('mycollection').get()


with open("Confluence.json", "w") as f:
    embeddings_formatted = [
        json.dumps(
            {
                "id": str(doc.id),
                "embedding": [str(value) for value in PaLM_embed(doc.to_dict()['content'].strip())],
            }
        )
        + "\n"
        for doc in docs if doc.to_dict()['content'].strip()
    ]
    f.writelines(embeddings_formatted)

We need to upload this .json file to a bucket on cloud storage to be accessible for matching engine:

EMBEDDINGS_INITIAL_URI = f"{your_bucket_name}/matching_engine/initial/"
gsutil cp Confluence.json {EMBEDDINGS_INITIAL_URI}

With that, we are ready to create our index. You can opt for gcloud to create your matching engine index as with most GCP resources. Alternatively, you can use the aiplatform python package. First we initialize aiplatform with the correct project name and region:

from google.cloud import aiplatform
aiplatform.init(project=YOUR_PROJECT_ID, location="us-central1")

Then the create_index method is called to process the data in our json file and create an index for it:

tree_ah_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=DISPLAY_NAME,
    contents_delta_uri=EMBEDDINGS_INITIAL_URI, #The bucket we uploaded our json to
    dimensions=768, # PaLM embeddings are 768 dimensional 
    approximate_neighbors_count=150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="My first index for confluence"
)

I should note that Google employs ScaNN as its approximate and nearest neighbor search algorithm. You can leave most hyper-parameters as default at least for the first trials and get reasonably good results. For more information on the default configuration and hyper-parameters check here. You might have to wait minutes for the index creation to finish. Once that’s done, go ahead and create an endpoint to which this index will be deployed.

According to GCP’s documentation, a VPC and peering allow for faster and more secure communication with your index. Therefore, in production, you may want to create your endpoint within a VPC network. To create the vpc and setting up the peering, run the following cell with your correct PROJECT_ID:

VPC_NETWORK = "matchingengine"

PEERING_RANGE_NAME = "ann-haystack-range"

if not os.getenv("IS_TESTING"):
     # Create a VPC network
     ! gcloud compute networks create {VPC_NETWORK} --bgp-routing-mode=regional --subnet-mode=auto --project={PROJECT_ID}

     # Add necessary firewall rules
     ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-icmp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow icmp

     ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-internal --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow all --source-ranges 10.128.0.0/9

     ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-rdp --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:3389

     ! gcloud compute firewall-rules create {VPC_NETWORK}-allow-ssh --network {VPC_NETWORK} --priority 65534 --project {PROJECT_ID} --allow tcp:22

     # Reserve IP range
     ! gcloud compute addresses create {PEERING_RANGE_NAME} --global --prefix-length=16 --network={VPC_NETWORK} --purpose=VPC_PEERING --project={PROJECT_ID} --description="peering range"

     # Set up peering with service networking
     # Your account must have the "Compute Network Admin" role to run the following.
     ! gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network={VPC_NETWORK} --ranges={PEERING_RANGE_NAME} --project={PROJECT_ID}

Note that the index can also be deployed to a public endpoint but that feature is still in preview.

After the vpc and peering are set up, go ahead and create your endpoint:

# Retrieve the project number
PROJECT_NUMBER = !gcloud projects list --filter="PROJECT_ID:'{PROJECT_ID}'" --format='value(PROJECT_NUMBER)'
PROJECT_NUMBER = PROJECT_NUMBER[0]
# Retrieve the full address of the vpc
VPC_NETWORK = "matchingengine"
VPC_NETWORK_FULL = "projects/{}/global/networks/{}".format(PROJECT_NUMBER, VPC_NETWORK)

# create the endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="your_desired_display_name",
    description="Endpoint for matching engine index of confluence",
    network=VPC_NETWORK_FULL,
)

Now that the endpoint is ready, we can deploy the previously created index to this endpoint via the deploy_index method of the endpoint.

my_index_endpoint = index_endpoint.deploy_index(
    index=tree_ah_index, deployed_index_id="confluence_index"
)

print(my_index_endpoint.deployed_indexes)

This will likely require some time as well. When ready, we can test our deployment like follows:

index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/5734775069/locations/northamerica-northeast1/indexEndpoints/1446455924850950144')
query = "What is our onboarding process?"
query_vector = embedding_model.get_embeddings([query])

response = index_endpoint.match(
    deployed_index_id="confluence_index",
    queries=[query_vector.values],
    num_neighbors=10,
)

print(response)

The response is an ordered list of tuples containing the id of the document, the embedding, and the corresponding dot product value to the query embedding. Since you used the same id as your database, you can easily fetch the corresponding document using this id.

For more information on index creation, updating your index, and deleting records from your index refer to this notebook.

Using PaLM as the LLM to generate answers based on the relevant documents to each question

This is the part where we utilize LangChain to stitch things together. Frankly, this could be achieved without LangChain or any other framework for that matter. The more you work with LangChain, LlamaIndex, or other “prompting” packages the more their opinionated design and ways of doing things stand out to you. LangChain overlays layers on top of layers of abstraction which can be annoying, especially for simpler tasks, and can result in confusing errors. Sometimes it is well worth it to build your prompts and LLM calls from scratch, especially for a costume production-level solution.

To keep things quick, we use the LangChain’s question-answering chain. The ingredients for a question answer are:

An LLM that will be responsible for reading the query along with the set of relevant documents retrieved by the vector db and composing an answer. We use PaLM for this.
A vector database to retrieve the set of relevant documents for the input query. Our deployed matching engine instance is the vector db here.

The vector db needs to be wrapped in a LangChain wrapper for it to work out of the box. LangChain provides support for many vector db’s including Pinecone, Weaviate, Qdrant, etc. and recently has integrated matching engine as well. However, the backend storage for this integration is google cloud storage, meaning, all your indexed documents need to be stored in a GCS bucket. Since we have our documents and metadata stored in firestore, I created a fork of LangChain and added support for firestore. Therefore, install LangChain from this fork via the command below:

pip install git+https://github.com/Majidbadal/langchain.git@firestore_supported_matchingengine

Hopefully, my pull request will soon be approved so we can access firestore on the main branch.

Once this version is installed, you can instantiate your matching engine wrapper with the .from_components method:

from langchain.llms import VertexAI
from langchain.embeddings import VertexAIEmbeddings
from langchain.vectorstores.matching_engine import MatchingEngine

# Ingredient 1
# Langchain's wrapper for PaLM
PaLM_llm = VertexAI()

# Ingredient 2
# langchain's wrapper for matching engine
PaLM_embedding = VertexAIEmbeddings()
vector_store = MatchingEngine.from_components(
    project_id = YOUR_PROJECT_ID, # your GCP project ID
    region = "us-central1",
    index_id = YOUR_INDEX_ID, # this id is printed when you created the index. Also visibleon GCP console
    endpoint_id = YOUR_ENDPOINT_ID, # Similarly, this is outputed upon creation. Also visible on GCP console
    firestore_collection_name = 'mycollection', # the name of your firestore collection
    embedding = PaLM_embedding # the same model you used to encode your documents.
)

Don’t forget to supply the firestore_collection_name argument with your collection name where your documents are stored in firestore.

Finally, we stitch them together in a RetrievalQA chain:

from langchain.chains import RetrievalQA


qa = RetrievalQA.from_chain_type(llm=PaLM_llm, 
                                 chain_type="stuff", 
                                 retriever=vector_store.as_retriever(search_kwargs={"k": 5}))

ask your queries like so:

qa.run("What is our onboarding process?")

A note about chain_type

As mentioned, LangChain has proven to have a very opinionated way of prompting and doing other things. There are four chain types (stuff, map_reduce, map_rerank, refine) offered by LangChain which you can read more about in the docs or refer to the spelled-out version in our article. These are just different schemas for prompting your LLM. You can and should consider all the other creative ways you may want to try in order to fit your specific case. Here, I used ‘stuff’ which gave good enough results for the first version. In my experience, stuff is the fastest as it only calls your LLM once. Your latency will increase significantly by the number of times you call the LLM. Therefore a chain type like refine which makes one call to the LLM per the number of documents retrieved will create a terrible user experience. Another reason for using stuff is that at least in this case, it gave better answers with fewer hallucinations. I suspect this is due to the fact that the more calls you make to an LLM the higher the probability of hallucinations and all the other chain types (map_rerank, map_reduce, refine) make multiple calls to an LLM and therefore may accumulate more nonsense. Just make sure you feed the correct number of documents so that:

You don’t violate the context size of your LLM. The more and longer documents result in going over the context size. We use PaLM with 8196 tokens context size which gave us a lot of flexibility in this matter. However, as you can see in the code block above I decided that only the top five documents usually have the information required to answer the question. So I have set my retriever to just return the top five docs:

vector_store.as_retriever(search_kwargs={"k": 5})

2. Sometimes casting a wide net and returning the top 10 or top 20 docs can cause hallucinations. Your data and your LLMs are not perfect. There may be overlapping or ambiguous content in your data which will confuse the LLM when reading longer texts. If your embedding vectors are high-quality and effective methods are used to fetch the most relevant docs, you should be fine with limiting the LLM to only look at the top few docs. Below, we will discuss some ways to enhance the retrieval’s accuracy.

Some ways to improve

Most improvements come from realizing the moving parts of our system and experimenting with different ways of doing things. Nothing is set in stone and no single approach is the best ever.

Paraphrasing the query

One way to increase accuracy is to optimize the input query. Your end-user is not an expert googler, or prompter and only cares about a swift and accurate response. Their query may not be phrased the best way or may lack context. Subtle changes in the query result in changes in the query embedding which in turn result in different cosine distances to the documents ultimately giving different answers. This leaves a lot of room for improvement by pre-processing the input query. After all, the input query will determine what will be fed to the LLM.

One solution is, for each input query, to generate multiple queries where each query asks the same question from a different perspective. You can easily prompt an LLM like below to generate these queries for you.

"""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions seperated by newlines.
    Original question: {question}"""

LangChain, however, provides the MultiQueryRetriever to facilitate this. Once the queries are generated, the union of all the fetched documents for each query is returned. This is supposed to make the system robust to minor changes in the input query:

You can convert your retriever to a MultiQueryRetriever in two lines:

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.llms import VertexAI

# Need an LLM for generating multiple queries
PaLM_llm = VertexAI

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=PaLM_llm
)

Now you can use this MultiQueryRetriever the same way you would use your original retriever:

qa = RetrievalQA.from_chain_type(llm=PaLM_llm, 
                                 chain_type="stuff", 
                                 retriever=multi_query_retriever)

Generating hypothetical answers

Another rather fundamental issue is the way we compare a query to documents for measuring relevance. Currently, the vector store measures the similarity of a question to a bunch of content that may contain the answer. This is sort of like measuring the similarity of an orange to a bunch of apples. What if we first generate a hypothetical answer to that question (be it wrong and including non-factual information) and use that hypothetical answer to find out if the documents have similar contents to this hypothetical answer? This feels more like comparing apples to apples. A popular method to rectify and achieve this is the idea of Hypothetical Document Embeddings (HyDE).

To illustrate with an example, think of a query like “What is the onboarding process?”. A hypothetical answer would be something like

“The onboarding process is a smooth set of procedures designed to ease a new employee onto the team and includes:
Meeting with the CEO and the HR manager in order to welcome the aboard.
Providing them with equipment including laptops, printers, and other accessories.
Introducing them to the projects and the ongoing engagements with clients.
Introduction to the teammates on each project and help get the started
etc.”

Obviously, this is a totally made-up answer. However, searching our database for a document that contains similar information to this is much better than looking for documents that are similar to “What is the onboarding process?”. In other words, this fake answer provides a template for the vector database and asks for content that roughly matches this template.

Although generating this hypothetical answer is a simple call to an LLM and can be done without LangChain, LangChain has a straightforward way of inserting it into our current workflow:

from langchain.chains import HypotheticalDocumentEmbedder
HyDE = HypotheticalDocumentEmbedder.from_llm(llm = PaLM_llm, 
                                             embeddings = PaLM_embeddings, 
                                             "web_search")

As you can see the method requires an LLM (PaLM), an embedding model (PaLM embedding), and a method (“web_search”) to prompt the LLM. Note that the underlying embedding model must be the same as the original model you used to encode your documents. It calls the LLM with the incoming query to generate the hypothetical answer, uses the embedding model to generate embeddings for that hypothetical answer, and sends the embedding over to the vector store to search for the most similar content:

Now simply replace the embedding in the vector store with this hypothetical document embedding:

vector_store = MatchingEngine.from_components(
    project_id = YOUR_PROJECT_ID, 
    region = "us-central1",
    index_id = YOUR_INDEX_ID, 
    endpoint_id = YOUR_ENDPOINT_ID, 
    firestore_collection_name = 'mycollection', 
    embedding = HyDE # *****
)

Note that you can try combining the previous two. Generate multiple queries for each input query, generate hypothetical answers for each of those generated queries, and then, take the union of the retrieved documents for the next steps.

Using Cross encoders

Let’s think about our current approach to similarity measurement. Depending on the value provided for distance_measure_type it is either a simple dot product or cosine similarity of two vectors. These vectors, although enriched by attention layers of neural networks, are only the output of the final layer of a language model whose sole purpose is not just similarity measurements. Feels a bit shallow.

What if we trained models just for gauging the similarity between two pieces of text? Imagine several deep layers of dot products and cross attentions between two input texts. These trained models will then be experts in measuring similarities. This is the idea behind cross encoders. Simply put, a cross encoder encodes two pieces of text according to one another and thus, captures deeper levels of similarity by infusing them along the deep layers of the network:

cross encoder (on the left) vs a dot product measurement (on the right) adapted from https://www.amazon.science/blog/improving-unsupervised-sentence-pair-comparison

There is no free lunch and a cross-encoder is not an exception. The one drawback to this model is the longer processing time. A simple dot product of pre-calculated vectors is quite faster than running a deep cross-encoder network on pairs of (hypothetical answers, documents). As a result, we break down the retrieval into two steps:

Find an initial large set of relevant documents using the vector database. In this step, we cast a wide net to gather all the documents that we suspect are relevant. We’d rather err on the side of precision.
Use a cross-encoder model to rerank the initial set of docs.

This establishes a good trade-off as the first step is blazingly fast and only burdens the cross-encoder model with a subset of documents to accurately re-order them in terms of relevance.

Now to the implementation. There are many cross-encoding/re-ranking models out there. We have found Cohere’s rerank API to be reliable and easy to integrate since it is also supported in Langchain.

from langchain.retrievers.document_compressors import CohereRerank
import os
# Don't forget to set the environment variable 'COHERE API KEY' 
# Otherwise langchain will throw an error. 
os.environ['COHERE_API_KEY'] = 'your_cohere_api_key'


cohere_rerank_compressor = CohereRerank(top_n=5)
compression_retriever = ContextualCompressionRetriever(
                     base_compressor=cohere_rerank_compressor, 
                     base_retriever=vector_store.as_retriever(
                                          search_kwargs={"k": 30}
                                           )
)

To give a little context on the code, Cohere’s rerank is implemented as a compressor in LangChain. A compressor is nothing but a module that compresses the initial set of retrieved documents. Sort of like a post-processor. There are compressors that similar to rerank, refine the initial set of documents retrieved by the vector store. There are also compressors that go into each document and only extract the most pertinent segments and weed out the unimportant pieces. You can also create your costume compressors. Finally, you can sequence as many of these post-processors as you would like so feel free to get creative and add your own spin on it. All that needs to be done to incorporate this is to wrap your initial vector store in a ContextualCompressionRetriever class along with the compressor.

Above, we first retrieve a set of 30 documents (signified by search_kwargs “k”=30). Then pass these to rerank to accurately re-order them and return the top 5. These top 5 documents are expected to be much more relevant than the top 5 documents returned in the previous version.

Now this compression_retriever product acts the same as your vector store retriever, meaning, you can pass it to your RetrievalQA as is:

qa = RetrievalQA.from_chain_type(llm=PaLM_llm, 
                                 chain_type="stuff", 
                                 retriever=compression_retriever)

Costume Prompts

Again, LangChain’s way of doing things is only one way. After a quick prototyping session, it is not very difficult to strip away all the rigid scaffolding of LangChain by not using its wrappers and writing your own code. This has the benefit of easier traceability plus it gives more room for creativity. Even if you can’t justify the extra effort, you can take advantage of LangChain’s flexibility to utilize costume prompts.

The default prompt for LangChain’s stuff method is the following:

"""Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, 
don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""

You may get much better answer’s by customizing the prompt to your company and set of documents:

"""Badal is a Google partner comprised of professionals with expertise accross
all cloud computing technologies. The following question is asked by a Badal
employee about a HR document in Badal's confluence. Act as an HR correspondent
and answer the question using the following pieces of context.

If you don't know the answer, just say that you don't know, 
don't make up an answer.

{context}

Question: {question}
Answer:"""

Sometimes your chatbot may be very sensitive to the changes in the prompt especially if there are gaps in your knowledge source.

Similarly, you should at least experiment with costume chains, costume agents, costume callbacks, costume tools, and other costume classes.

One simple suggestion is to use the load_qa_chain instead of the RetrievalQA chain for better visibility into the intermediate steps (refer to the LangChain intro here). It is as simple as swapping the classes:

from langchain.chains.question_answering import load_qa_chain
QAchain = load_qa_chain(PaLM_llm, chain_type='stuff')

docs = compression_retriever.get_relevant_documents(query)
chat_response = QAchain({"input_documents": docs, "question": query}, return_only_outputs=True)

This gives you a chance to examine the retrieved docs before passing them over to the QAchain. Are they exactly the set of documents you would expect for the given query?

You can further drill down in this step using a tool like Arize.ai’s phoenix, Langsmith (currently in private beta), or W&B to debug your queries, and documents, and overall identify pitfalls and troubling queries.

A note on hallucination

hallucinations are common and not trivial to solve. There is no one size fits all as of yet and people find different techniques more or less effective in different scenarios. For this use case, you may find that reducing the amount of context fed to the LLM to answer the question helps with hallucinations. If your retrieval is adequately accurate, you may even get away with passing the top 2 documents to the LLM.

Another approach is to enforce some sort of structure in the response, like a tabular or json structure, to preclude the LLM from going on tangents. This may be harder to fit a chatbot with open-ended broad questions of high variability.

With a costume prompt, you can ask the LLM to cite the relevant parts of the documents to keep it’s answers ground. Better yet, you can try a method like FLARE if the additional latency is not significant.

If you have very many documents discussing a broad range of distinct topics, and you know the incoming queries can be answered by a single subset, you can try bundling your documents together. Divide your set of documents into subsets and create separate vector stores and QA chains for each. Then you can route the incoming query to the correct QA chain using something like the router chain. Alternatively if multiple of those chains can answer the query distinctly from different perspectives, route the query to those chains and aggregate the answers into a final comprehensive response. This method requires you to be familiar enough with the knowledge source and the user habits (their common queries).

Finally, don’t forget to experiment with the temperature parameter.

A note on privacy and access levels

Typically in any organization, there are different levels of access and security. Access Control List (ACL) is a critical component of this search bot architecture and should be enforced to ensure the right people can search the in the right set of documents so that for example, a new employee does not receive private information about other employees in the bot’s response.

Conclusion

In this blog, we went over a question-answering solution that features PaLM as the underlying LLM and embedding model, Google’s Matching Engine as the vector database, and Firestore as the document store. Some challenges were discussed and some ideas were explored for improvement. Please feel free to ask any questions in the comments so I can clarify.

As mentioned, you can always refer to our previous blog on LangChain and LlamaIndex to understand the nuances of similarities and differences between the two and how to use each.

Stay tuned for our next blog post where we will show how this can be integrated with Slack and deployed on GCP’s infrastructure.