RAG basics using a self hosted OpenAI compatible LLM server

9 min readFeb 8, 2024

Advanced AI language models , such as OpenAI’s ChatGPT and Google’s Gemini have been at the forefront of driving innovations in various fields lately. This includes products such as chatbots, content generation systems, translation services, and more. Their ability to understand context, generate coherent responses, and perform complex language tasks has paved the way for a new era in artificial intelligence.

Technology, however, is not completely perfect and comes with certain caveats. Large Language Models (LLMs) are not an exceptions and users must carefully consider a crucial disadvantage of this technology. Sometimes LLMs generate content that contains information which significantly diverges from the “real” and “expected” one. However, those models are trained to be very writing confident sounding text and and they don’t fail to convince the unexpecting user of wrong facts.

This phenomenon is commonly referred to as “model hallucinations." To solve this issue, a common practice is to prompt the model with additional information that will guide it to the right path when generating its output.

Retrieval Augmented Generation (RAG) is the currently standard approach applied to enhance the quality and relevance of LLM generated content. With its help, companies include their knowledge base on specific previously unknown to the model topic, and achieve great model performance. Lets delve deeper into this post and see how RAG is reshaping the landscape of Generative AI.

Visual presentation of the Retrieval Augmented Generation (RAG) concept

Prerequisite

To start with, let us see what toolset we are going to need to build our own RAG system. We are going to use standardized frameworks, which are open source, and host all of the needed infrastructure locally on our system for maximal cost efficiency.

LangChain

LangChain is a toolbox for building cool stuff powered by large language models. Imagine it as the Swiss knife for any GenAI use case. In this tutorial we are going to use it for data loading, data preprocessing, and as an additional automation layer for the other tools discussed in the prerequisite. Visit the official website to learn more about the features that LangChain provides. For now let’s install it with:

pip install langchain-openai langchain langchain-community

Vector Database

Let’s consider the following situation:

We have a large knowledge base and we want to incorporate it to the knowledge of our chat bot. Our LLM is limited by the number of tokens (i.e. the word count) that it can process. Therefore, we need to find a way to extract only the most relevant information needed, and include it in the prompt so that the model can use it as context.

Let’s imagine a hypothetical situation. A veterinary clinic wants to hosts an AI chat bot that answers questions about animal healthcare. When users prompt the model with questions about dogs, it won’t make a lot of sense to include information about cats in the context provided to the model.

Thanks to the vector database we can solve exactly this issue. The vector database is simply a place where you store vectors. In the RAG case, the contained vectors are mathematical representation of the knowledge base. With its help and the help of some neat math tricks, information can be found in the knowledge base in a very fast and efficient way.

Let’s get our hands dirty and install a vector database. We’ll use the open-source FAISS by Meta:

pip install faiss-cpu

Note: This command will install a CPU-only version which is significantly slower than the GPU one, but it ensures it will work on any system. If your system has a GPU and you want to increase the performance of your database, follow the official instructions here.

Embeddings model

AS already mentioned, in order to save your knowledge base in the vector database, you need a way to represent text as vectors that capture semantic meaning. This is achieved with the help of an embeddings model.

If you search in google, you will see that there are numerous embeddings models out there. All of them have different advantages and disadvantages. Usually, the go-to paid choice is OpenAI embeddings. However, in this tutorial we want to stay in the open-source realm and self-host the all-Mini-L6-v2 embeddings model. This is where LangChain has our back, as it supports direct use of the embedding models hosted on HuggingFace. We only need to install the following library:

pip install sentence_transformers

OpenAI compatible server

Lastly, we are going to need a large language model (LLM), which will act as our reasoning engine. It will generate coherent output that is based on the context from the knowledge base and the user prompt. Open AI models such as GPT-3.5 and GPT-4 have a great performance and are always a good choice when building RAG systems. If you have paid membership or free credits, go with them as your choice for language model.

In this tutorial we’ll use our own model running locally on CPU. In addition, we will host it on an OpenAI compatible server. If you’ve read my previous post, you’ve seen already how to deploy a model locally on CPU. We are going to use again the library llama-cpp-python, but this time we will utilize the included OpenAI compatible model hosting option. To begin with let’s install it with:

pip install llama-cpp-python[server]

Also, let’s use again a quantized Llama 2 7B model, which you can download from here.

Note: If you are wondering what the Q prefixes mean and which model to download exactly, check out the documentation here.

Implementing the RAG system

LLM API

Let’s start with the OpenAI compatible model server. Assuming you’ve installed the llama-cpp-python library and downloaded your model, you can now start a model server simply by running in your terminal:

python -m llama_cpp.server --model <path-to-your-model>

Note: By default the model is started with a context window of 512 tokens. You can pass the flag — n_ctx 2048 to increase the context length of the model to 2048 tokens (or more if you want).

If your server started correctly, it should look something like this.

After your server has started you can invoke it the same way you invoke the OpenAI API. Now, let’s test the server using the Open AI quick start example with updated API URL and access token:

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -H "Authorization: Bearer sk-xxx"   -d '{
    "model": "llama-2-chat",
    "messages": [
      {
        "role": "system",
        "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
      },
      {
        "role": "user",
        "content": "Compose a poem that explains the concept of recursion in programming."
      }
    ]
  }'

Note: The API access token for the self hosted model is arbitrary.

Which results in the following output:

{
   "id":"chatcmpl-25aa8564-4ddd-49e2-b60b-a837e78147fe",
   "object":"chat.completion",
   "created":1707420657,
   "model":"llama-2-chat",
   "choices":[
      {
         "index":0,
         "message":{
            "content":"  In the realm of code, where lines entwine,\nA loop of instructions, a maze divine,\nRecursion's the art, the programmers know,\nTo unravel complexity, as snow does flow.\n\nIt starts with a call, a spark of light,\nA function that's called, and then takes flight,\nInside the loop, it finds its base,\nAnd then it recurses, a wondrous race.\n\nIt digs within itself, a spiral stair,\nA never-ending chain of \"calls\" so fair and square,\nEach iteration adds to the base,\nA tree of solutions, a computational race.\n\nThe roots grow deep, the branches wide,\nA solution for each problem to abide,\nIn every node, a solution's found,\nA recursive solution, never to be bound.\n\nThe programmer's art, a wondrous feat,\nTo solve a problem, so complex to meet,\nRecursion's power, a magic spell,\nTo tame the beast, and make it tell.\n\nIt's a loop within a loop, a dance divine,\nWhere complexity is made to shine,\nIn recursive joy, we find our way,\nTo solve the problems, come what may.\n\nSo here's to recursion, a programmer's guide,\nA tool so grand, to problems to abide,\nIn code we find our solace and peace,\nWith recursion, our problems to cease.",
            "role":"assistant"
         },
         "finish_reason":"stop"
      }
   ],
   "usage":{
      "prompt_tokens":53,
      "completion_tokens":336,
      "total_tokens":389
   }
}

As you can seen, the JSON response has the same structure as the one that we get from the OpenAI API. This will be of great help if later in the project development we decide to replace the locally served model with an OpenAI model.

Creating the knowledge base

Let us mock a knowledge base using a wiki. For the sake of this tutorial we will use the page of the character Mimir in the God of War fandom wiki, because our model doesn’t have internal knowledge on this topic.

To start with, we will load the text data from the website using the web loader from LangChain:

from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://overgear.com/guides/wow/dragonflight-world-bosses/")
documents = loader.load()

In the next step, we will split the data into smaller chunks. Those chunks correspond to the documents in our knowledge base:

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Creating the vector store

Now that we have the documents of our knowledge base, let’s use the embeddings model to create vectors and store them in FAISS:


from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = FAISS.from_documents(docs, embeddings)

We can now search the database for documents which are semantically similar to the user input. For example, if we want to search for information on the name of the character Mimir, we can do something like:

question = "What is the other name of Mimir?"
docs = db.similarity_search(question)
print(docs[0].page_content)

In the last row we are printing the document that is contextually closest to our question. The documents contains the following information taken from the fandom wiki:

Mimir (Nordic: ᛗᛁᛗᛁᚱ), formerly known as Puck, is a Celtic fae who became Odin’s advisor and the ambassador of the Aesir Gods until Odin imprisoned him for 109 years. After being freed by Kratos and Atreus, he became their ally. He is the tritagonist of God of War (2018) and God of War Ragnarök.

As we can see the document contains information about the other name of Mimir but also other information which we do not need. This is expected, because there is no generational step. Here we simply search for information from the existing text.

Retrieval Augmented Generation

Now that we have a vector store with knowledge base about Mimir and an LLM server, we can use both of them together to achieve RAG. We do this by including the documents retrieved from the vector database in the model prompt. In this way, we can selectively filter and output only the information that the user asked about and achieve a coherent and nicely sounding output. Let’s take a look how this is done in code.

First we start by creating an object for interaction with our chat model:

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

chat = ChatOpenAI(model_name="llama-2-chat", 
                  openai_api_base="http://localhost:8000/v1",
                  openai_api_key="sk-xxx", 
                  max_tokens=2048,
                  temperature=0.7)

Then we create the prompt. We are going to put some instructions for the model behavior and the content of the retrieved document in the system message. In the user message we are going to put the same question used to retrieve the document from the vector store:

messages = [
    SystemMessage(
        content=f"""You are a helpful assistant that answers questions based on a given context. You use only the current context for your answer. If the question cannot be answered directly from the context then output "N/A".
        Current context:
        {docs[0].page_content}
         """
    ),
    HumanMessage(content=question),
]

We can then query the model and see what it will output:

response = chat.invoke(messages)
print(response.content)

Comparing the result

The output of the RAG based prompt was:

Based on the provided context, the other name of Mimir is Puck.

In comparison, prompting the model the same question without the context results in (non RAG prompt):

Mimir is also known as Mim to some people. It’s a common nickname or shortened form of the name Mimir. So, you can use either “Mimir” or “Mim” interchangeably when referring to this fascinating being from Norse mythology!

As we can see, the model generated something which is true, but it doesn’t relate to the game God of war. This information is not part of our knowledge base, therefore in certain situations this behavior can be considered as model hallucinations. We saw that applying RAG provides better context to the language models. In this way our chat bot can perform significantly better.

Final words

Congrats on finishing another post about generative AI! :)

I hope that this article has helped you to gain some initial knowledge on the subject, and will act as a starter for further education. RAG is a very powerful approach and can be used to create complex and robust AI systems. Stay tuned for more advanced posts in the future by following me.

For any questions or feedback use my website to contact me,

or find me on LinkedIn: https://www.linkedin.com/in/penkow/

Happy Coding! :)

Find all of the code from this post here: https://github.com/penkow/rag-basics