Building an Internal Chatbot That Knows All Your Docs: a Step-by-step guide

Published in

Taranis Tech

10 min readSep 12, 2024

I’m one of the first employees in Taranis, and I’ve been in the company for 8 years. I remember the day we grew to more than one team, and we finally decided to have some internal documentation, choosing Confluence.

Most of my days, I’m interrupted at least 4–5 times with some random question about legacy parts only I remember… So I finally decided to be a responsible adult and write down all those answers.

Guess what? Nobody was reading my documents! It’s not that people were lazy, they just couldn’t find them or didn’t even know they existed.

So a few weeks ago I decided to solve it once and for all. Using an open-source repo, I wrote an internal chatbot that lets everyone in our company simply ask questions, and receive correct answers based on all our internal knowledge.

Finally, I can get some work done :)

The Solution: A Chatbot for Confluence and GitHub

In this article, I’ll walk you through all the implementation steps. The end result is a ChatGPT-like UI, and a simple way to query the bot using a Slack command:

We going to cover 6 steps:

Initial setup of the project
Deploying to production — switching to MongoDB Atlas Vector Search
Adding multiple domains, to allow users to search specific ones
Adding GitHub README files and Slack public channels as another data source
Improving the accuracy of the answers by getting answers from a single data source
Implementing the Slack Chatbot

1. Initial setup of the project

Our starting point is the RAG-Chatbot-with-Confluence project (many thanks to Florian Bastin for his great work!).

The project uses OpenAI’s language model and Langchain, a tool for building AI applications. Langchain helped us quickly create a way pull and process documents from both Confluence and GitHub, using its built-in document loaders. The project uses Streamlit as the UI to interact with the chatbot.

We’ll explain in a minute how it works, but first, let’s run it locally:

Dockerizing the project — we added the following Dockerfile:

FROM python:3.10

ARG BUILD_LOCAL  

COPY requirements.txt ./

RUN pip install -r requirements.txt

RUN mkdir -p /app

WORKDIR /app

COPY . /app

Convert all the prompts and UI instructions from French to English.
Add your OpenAI API key and Confluence API key, email, space key, space name, and email address to a new .env file.
Run the following commands:

cd src
streamlit run streamlit.py

And now you can ask questions through the Streamlit UI!

2. Productionalization — MongoDB Atlas Vector Search and deployment

Before we dive deeper, let’s do a (short) pause to understand what happens behind the scenes.

The problem with classic GenAI models is that they can only generate answers based on what they’ve been trained on — and training them is VERY expensive.

RAG (Retrieval-Augmented Generation) is a very efficient way to make those models use custom data. It works as follows:

First, you ‘embed’ (a way of representing text in numbers that capture meaning) your data sources in a vector data store. We used OpenAI’s embedding API.
Then, when the user submits the query, you embed the query itself, and search your database for the closest embeddings. This finds documents that are similar to the question, even if the exact wording differs a bit.
Then, you pass the external GenAI model both the question and the relevant documents, and it provides an answer that is grounded in real, up-to-date information.

A huge advantage of using RAG is that it can provide the source of the answer, so you can easily verify it.

Plus, since the answers are grounded in real data, it helps avoid “hallucinations” — where traditional AI models can generate incorrect or misleading information.

Ok, back to business.

Florian’s project uses Chroma as the vector data store, which is great for local experimentation on your laptop. To make it ready for production usage by multiple users, you’ll need a production-ready vector data store.
We chose MongoDB Atlas as our Vector Search for this as we had good experience with MongoDB in previous projects.

Here is the gist of the code that saves the documents to MongoDB:

from pymongo import MongoClient
from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch


DB_CONNECTION_STRING = 'YOUR_MONGO_CONNECTION_STGRING'
DB_NAME = 'YOUR_DB_NAME'
COLLECTION_NAME = 'YOUR_COLLECTION_NAME'


def save_to_db(self, splitted_docs, embeddings):
    """Save chunks to Mongo DB"""
    client = MongoClient(DB_CONNECTION_STRING)
    collection = client[DB_NAME][COLLECTION_NAME]        
    
    db = MongoDBAtlasVectorSearch.from_documents(
        documents=splitted_docs,
        embedding=embeddings,
        collection=collection,
        index_name='knowledgebase_search_index',
    )        
    
    return db

Here is an example of how the data is stored in MongoDB

In this diagram, from MongoDBAtlas, you can see the full architecture of the solution:

MongoDB RAG architecture — MongoDB RAG Architecture: https://www.mongodb.com/developer/products/atlas/rag-atlas-vector-search-langchain-openai/

Document Ingestion: we ran an ‘ingestion’ pipeline that ingested documents from Confluence using the OpenAI embedding model. We sent the text of the documents to OpenAI’s embeddings API, which converted the text into a high-dimensional vector, with each number in the vector representing different features or meanings of the text. The documents were then stored and indexed in MongoDB.
Query Processing: When a user asks a question, the chatbot first retrieves relevant documents based on the query using a semantic similarity search. It then uses the OpenAI language model to provide a concise and accurate response.

3. Allow users to specify data domains to search through

When users started to ask the Chatbot questions, we saw that sometimes the chatbot was basing its answers on data from different Confluence spaces.

For example, when a developer asked a question, he got a response that was partially based on a support confluence page, which was not relevant to the developers.

That’s why we decided to add a domain property to our data stored in the vector data store — the data store allows adding metadata to each record and by adding the domain property, we could later allow users from each department to ask the Chatbot to look for answers only in documents from the relevant domain. We added radio buttons in the Streamlit UI so each user can choose which domain he/she wants to query.

Also — we added the source property to each record in the vector data store as the text from the documents was split into smaller chunks and saved in the data store and we wanted to know what is the source document of each chunk of text. You’ll see how this helps improve the accuracy of the answers later.

Here is a code example of adding these properties:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter


CONFLUENCE_SPACES_TO_DOMAINS = {   
    "RD": "rnd", 
    "AAKC": "ag", 
    "STA": "support",   
    "PT": "product",    
    "OA": "ops"
    }


def split_docs(self, docs):
    # Markdown
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
    
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    # Split based on markdown and add original metadata
    md_docs = []
    for doc in docs:
        md_doc = markdown_splitter.split_text(doc.page_content)
        for i in range(len(md_doc)):
            md_doc[i].metadata = md_doc[i].metadata | doc.metadata
            domain_match = re.search(r'/spaces/([^/]+)/', md_doc[i].metadata['source'])
            md_doc[i].metadata['domain'] = CONFLUENCE_SPACES_TO_DOMAINS[domain_match.group(1)] if domain_match else None
            md_doc[i].metadata['provider'] = 'confluence'
        md_docs.extend(md_doc)
    
    # RecursiveTextSplitter
    # Chunk size big enough
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=20,
        separators=["\n\n", "\n", "(?<=\. )", " ", ""]
    )
    
    splitted_docs = splitter.split_documents(md_docs)
    return splitted_docs

Here, the source property is already part of the ConfluenceLoader output which is used to load the Confluence documents.

And this is how it looks in MongoDB:

And the Streamlit UI with the ability to choose a domain:

4. Adding GitHub README and Slack public channels as another data provider

When users started using the Chatbot, I realized that some important knowledge is stored only in our Github repositories README files — such as how to run each project, specific guidelines, troubleshooting etc.

We realized that this was another important data provider that can improve the answers of the chatbot.

Lucky for us, LangChain’s community has many document loaders that are ready for use and one of them is GithubFileLoader.

Here is a minimal code example:

from github import Github, Auth
from langchain.document_loaders import GithubFileLoader


GITHUB_TOKEN = 'YOUR_GITHUB_TOKEN'


def load_from_github(self):
    auth = Auth.Token(GITHUB_TOKEN)
    g = Github(auth=auth)
    github_documents = []
    for repo in g.get_user().get_repos():
        try:
            loader = GithubFileLoader(
                                repo=repo.full_name,  # the repo name
                                access_token=GITHUB_TOKEN,
                                github_api_url="https://api.github.com",
                                branch='master',
                                file_filter=lambda file_path: file_path == 'README.md'
                        )
            documents = loader.load()
            for document in documents:
                document.metadata['source'] = document.metadata['source'].replace("https://api.github.com/", "https://github.com/")
                document.metadata['domain'] = 'rnd'
                document.metadata['provider'] = 'github'
                filename = re.search(r'/([^/]+)$', document.metadata['source']).group(1)
                document.metadata['title'] = f'{repo.name} {filename}'
            github_documents.extend(documents)
        except Exception as e:
            logging.warning(f"Error while loading repo {repo.full_name}: {e}")
    return github_documents

We decided to use the opportunity and also add Slack public channels as additional data sources, as there are some surprisingly useful things discussed solely on Slack.

We created a data dump from Slack and manually uploaded it to Mongo Atlas Vector Search using Langchain’s SlackDirectoryLoader.

The beauty of working with RAG is that it doesn’t really matter where the data comes from — you can take any text and embed it into your vector store!

If you use other systems for documentation, you can easily add it as a data provider using a LangChain document loader if it exists, or write a document loader yourself and help the community 😉

A few words on Slack as a data source. There are 2 main issues:

Each Slack message is a different document by default, which means that when you search for an answer using the Chatbot, the answer might be spread across multiple messages, and using the naive approach of semantic similarity won’t give good results.
The results might be old Slack conversations that are not relevant anymore — thus a custom ranking algorithm needs to be applied where newer messages get higher scores than older messages, and this needs to be done in sync with the similarity search algorithm.

5. Improve accuracy — getting answers from a single data source

In the first iteration, we used semantic search and retrieved the 3 most relevant chunks, and provided them as context to the LLM.

The problem we encountered is that these chunks were part of different Confluence pages / GitHub README files, so the answer we got from the LLM mixed multiple topics and was not always correct.

To improve the accuracy, we used the semantic search and first retrieved only the most relevant chunk. After that, we queried the vector store again and searched for all chunks that are part of the same source page using the source property we discussed above.

That way we were able to provide the LLM with a cohesive and full page that is relevant to the question, which improved the answer drastically.

In order to do that, we inherited the MongoDBAtlasVectorSearch class and overrode the similarity_search method:

from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
from langchain_core.documents import Document
from typing import Optional, Dict, List, Any



class MongoDBAtlasTaranisVectorSearch(MongoDBAtlasVectorSearch):
    def similarity_search(
        self,
        query: str,
        k: int = 4,
        pre_filter: Optional[Dict] = None,
        post_filter_pipeline: Optional[List[Dict]] = None,
        **kwargs: Any,
    ) -> List[Document]:        
        docs = super().similarity_search(query, k, pre_filter, post_filter_pipeline, **kwargs)
        
        if not kwargs['single_source']:
            return docs
        if not docs:
            return []
        
        most_similar_doc = docs[0]
        
        source = most_similar_doc.metadata.get("source")
        if not source:
            return []
        
        matching_docs = self._collection.find({"source": source})
        
        result_docs = []
        for doc in matching_docs:
            text = doc.pop(self._text_key)
            result_docs.append(Document(page_content=text, metadata=doc))

        return result_docs

6. Implementing a Slack Chatbot

While having a chatbot was great, we wanted to make it even more accessible to our developers. Slack is where our team communicates, so it was the perfect platform for this integration.

To make the chatbot accessible in Slack, we implemented the /devask slash command. With this command, developers could now ask questions directly in Slack, and the chatbot responds in real time. This feature has significantly reduced the time spent searching for information, allowing our team to focus more on coding and less on hunting down documentation.

The main challenge was that Slack requires you to return the response in 3 seconds and the RAG process can take a bit more than that. To solve this issue, we had to send an acknowledgement response to slack immediately, and we put the request from slack in a FastAPI background task. When the RAG process finished, we sent the response back to Slack.

And this is how it looks:

And we get the answer after a couple of seconds:

Conclusion

Building a chatbot that could query our internal documents was a challenging yet quite rewarding experience!

By using LangChain, OpenAI’s API, and MongoDB Atlas Vector Search, we were able to quickly create a solution that met our unique needs.

If you’re facing similar challenges with your internal documentation, I encourage you to play around with RAG-based chatbots. Believe me, everyone in the organization will thank you.

If you try to do it in your organization and have any questions — feel free to ask in the comments!