Sitemap
Thoughtworks: e4r™ Tech Blogs

We are a cohort of passionate researchers and engineers specialized in computer and software engineering working for Thoughtworks.

Building Your Own AI Assistant: A Local LLM and RAG Application

--

Photo by Growtika on Unsplash

The release of Large Language Models (LLMs) like ChatGPT and Gemini has marked the start of a new era for text generation, and since then we have been using them for a wide variety of tasks. Although these models have shown tremendous capabilities ranging from generating human-quality text to writing different kinds of creative content, they come with their shortcomings. Some of these models have a knowledge cutoff at a certain time period, which often leads to confident but inaccurate responses (also called hallucinations). One way to overcome this is to fine-tune these LLMs based on use-case specific data. However, this approach may turn out to be inefficient when the data keeps changing frequently, as fine-tuning an LLM repeatedly may not be resource optimal. Another way to tackle this issue is, grounding our LLM on use-case-specific data. This is where Retrieval-Augmented-Generation (RAG) comes into the picture.

RAG is an AI framework that improves the accuracy and reliability of LLMs by grounding them in external knowledge bases.

In this blog, we will look at our exploration around setting up a RAG application around the documents related to Engineering for Research (e4r™) and its work.

Retrieval Augmented Generation

RAG application for e4r

The goal of this exercise was to explore setting up a RAG application with a locally hosted LLM. For this activity, we used LangChain to create a document retriever and pipeline. A typical RAG application consists of an LLM, a document loader, a vector store, a sentence embedding model, and a document retriever. Let us look at each of them one by one.

The Large Language Model

For the LLM, we used a locally hosted 4-bit NormalFloat (NF4) quantized Mistral 7b. The details of hosting an LLM on our workstation and the specifications of the workstation itself are provided in below post.

The model was used with random sampling of the next token enabled by setting the parameter do_sample to True. Apart from this, we only changed the following parameters while keeping all others as defaults:

  • temperature: 0.2
  • repetition_penalty: 1.1
  • max_new_tokens: 1000
  • return_full_text: True

Document Loader

The documents available with us were in Microsoft Presentation format (pptx), Microsoft Word format (docx), and Portable Document Format (pdf). The following list provides the document loader used by LangChain and the metadata that was extracted from these documents:

Vector Database

A vector database is a type of database that stores data as high-dimensional vectors for fast retrieval and similarity searches. In RAG, the use-case-specific data is converted to a high-dimensional vector, and the relevant data is retrieved based on its similarity to the user question. For our RAG, we used a standalone version of Milvus DB running in a Docker container. The vector had a single collection called E4RDocs, which stored all the documents in vector embeddings along with their metadata.

Sentence embedding

Embedding a document into a high-dimensional vector representation is a crucial part of a RAG application, as this embedding is used to retrieve the relevant documents by finding their similarity with the user query.

In our RAG application, we used the bge-large-en-v1.5 embedding model by the Beijing Academy of Artificial Intelligence (BAAI). We chose this model because it was the top-performing model with a size less than 1.5 GB of the Massive Text Embedding Benchmark (MTEB) Leaderboard at the time of implementation. In particular, we used our embedding model with normalized embedding. Further, we chose to run the embedding model on the CPU as 6GB out of total 8GB of GPU VRAM were occupied by the LLM.

Document Retriever

A retriever is an interface that retrieves documents for a given query. We used this retriever to store documents too.

Let us first understand how a document retriever helps in a RAG.

In a RAG, a document page is converted into smaller chunks, which are stored in a vector store. When a user asks a query, the query is matched with these chunks and the most similar ones are retrieved as relevant context to be passed to the LLM along with the question However, sometimes it happens that a page has lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together to form a coherent context. This was our case so we used a parent document retriever that indexes multiple chunks for each document. Then find the chunks that are most similar in embedding space to the query, but retrieve the whole parent document and return that rather than individual chunks.

In a parent document retriever, we need to chuck a document at two levels: one at the child level, which is a smaller chuck for a vector store, and another at the parent level which is a larger chuck. For both of these, we recursively split the text by character with ["\n\n", "\n", ".", " ", ""]as separators. Other parameters for parent and child splitters are as follows:

  • Parent splitter:
    - chunk_size: 2000
    - chunk_overlap: 100
  • Child splitter:
    - chunk_size: 400
    - chunk_overlap: 10

Numbers are the number of characters.

Two major problems faced the time of implementation:

  1. Problem: Whenever we used to get a query that did not return any context from the retriever, even though we had specifically provided in the prompt to the LLM to say, “Given the context, I do not know the answer,” the LLM used to hallucinate and make up an answer.
    Solution: We overcame this by adding a runnable lambda after the retrieval step. This checks if there is a context or not. If the context is empty, it updates the prompt and question for the LLM to say, “Given the context, I do not know the answer”, else pass the context and question to the LLM as they are.
  2. Problem: Sometimes it used to happen that the context returned by the retriever was long enough to not fit into the remaining GPU memory.
    Solution: To solve this, we implemented a map-reduce document chain. This solution involved multiple queries to the LLM; however, as we were running our LLM locally, this did not pose a challenge in terms of pricing.

Suggested improvements

  1. This version of the RAG application did not remember previous questions and answers, which can be solved by incorporating a conversational memory. This can also be achieved in two ways:
    ★ Adding conversation memory to the prompt. The challenge would be the increasing conversation memory will make the prompt too large for it to fit into GPU memory.
    ★ Storing conversations in a vector store, and the conversation history is retrieved by matching the user question to the stored conversations. This also has some caveats. First, the question needs to be modified to fetch the relevant conversation from the vector store. Second, the retriever may fetch conversations that do not match the question asked.
  2. Currently, even if the user asks a question that is already answered, the query would go to the LLM. To avoid this, a cache could be maintained for questions that have been asked along with their answers. From here, the question matching (full text match or vector similarity) to the new question could be fetched, and the stored answer could be passed as the answer without the need for invoking the LLM.

Conclusion

The exercise helped us gain an understanding of implementing a RAG application and insights into the challenges faced when implementing it on consumer-grade hardware with limited computing resources. This activity also resulted in a simple platform that can be used to ask questions about e4r, work we have done and question the papers we have published so far. As this exercise showed that LLM-based applications can be deployed with limited resources, such applications, other than RAG, can also be used on consumer-grade hardware.

Disclaimer: The statements and opinions expressed in this blog are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

--

--

Thoughtworks: e4r™ Tech Blogs
Thoughtworks: e4r™ Tech Blogs

Published in Thoughtworks: e4r™ Tech Blogs

We are a cohort of passionate researchers and engineers specialized in computer and software engineering working for Thoughtworks.

No responses yet