Implementing RAG architecture using Llama 2, Vector Store and LangChain

Published in

Infer

13 min readDec 13, 2023

Hudson Buzby, Solutions Architect at Qwak

Learn how to build a chatbot that answers questions about our documentation and generates Qwak specific code and examples using that documentation.

As we’ve discussed in previous posts about RAG & LLMs, RAG, or Retrieval Augmented Generation, can be a powerful framework for strengthening the capabilities of large language models and providing the context and information necessary to answer questions that are specific to your organization or use case. There are many different components that you can swap out to build a RAG pipeline, and Qwak provides several solutions to make this easy.

In this blog, we’re going to build a Qwak chatbot that answers questions about Qwak documentation and generates Qwak specific code and examples. To do this, we’ll be using Llama 2 as an LLM, a custom embedding model to translate natural input to vectors, a vector store, and LangChain to wrap the retrieval / generation steps , all hosted and managed within the Qwak platform.

Deploying Llama 2

First we’ll need to deploy an LLM. Any LLM with an accessible REST endpoint would fit into a RAG pipeline, but we’ll be working with Llama 2 7B as it’s publicly available and we can pull the model to run in our environment. To access Llama 2, you can use the Hugging Face client. You’ll need to create a Hugging Face token. You’ll also need to request access to Llama 2 within Hugging Face and agree to Meta’s terms of use for the model (Note: this can sometimes take a few hours to get approved). Once you have your token and access approval, you can store the token as credentials using the Qwak Secret Service. Now, we can get started building and deploying Llama within Qwak.

In a local code editor, you’ll import and create a model class wrapping the Qwak Model Interface. The Qwak Model class has two main functions:

build() — the entrypoint for training our model
predict() — the entrypoint for serving predictions of our model

Since our model is pretrained, our build logic is actually pretty simple. We retrieve our Hugging Face credentials from the Qwak Secret Service Client, and pull the Llama 2 build from Hugging Face. And that’s it!

For our predict function, we’ll need to define how we extract the prompt from the incoming data request. We tokenize the prompt using the model specific tokenizer from Hugging Face, pass that prompt to the model’s generate() function, decode the response, and return it to the user. If there was more information you wanted to include in the request such as a user_id or session token, you could also configure this in the predict() logic, but for now, we’ll keep it simple.

import qwak
from qwak.model.schema import ModelSchema, ExplicitFeature
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import torch
from pandas import DataFrame
from qwak.model.base import QwakModel
from huggingface_hub import login
from qwak.clients.secret_service import SecretServiceClient

class Llama2MT(QwakModel):
   """The Model class inherit QwakModel base class"""
   def __init__(self):
       self.model_id = "meta-llama/Llama-2-7b-chat-hf"
       self.model = None
       self.tokenizer = None

   def build(self):
       secret_service: SecretServiceClient = SecretServiceClient()
       hf_token = secret_service.get_secret("<huggingface-secret-name>")
       login(token=hf_token)
       tokenizer = AutoTokenizer.from_pretrained(self.model_id)
       model = AutoModelForCausalLM.from_pretrained(self.model_id)

   def schema(self):
       model_schema = ModelSchema(
           inputs=[
               ExplicitFeature(name="prompt", type=str),
           ])
       return model_schema

   def initialize_model(self):
       self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
       secret_service: SecretServiceClient = SecretServiceClient()
       hf_token = secret_service.get_secret("<huggingface-secret-name>")
       login(token=hf_token)
       self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
       self.model = AutoModelForCausalLM.from_pretrained(self.model_id)
       self.model.to(device=self.device, dtype=torch.bfloat16)

   @qwak.api()
   def predict(self, df):
       input_text = list(df['prompt'].values)
       input_ids = self.tokenizer(input_text, return_tensors="pt")
       input_ids = input_ids.to(self.device)
       outputs = self.model.generate(**input_ids, max_new_tokens=100)
       decoded_outputs = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
       return pd.DataFrame([{"generated_text": decoded_outputs}])

Now that we have our model built, we’ll kick off a build. You can see the commands for building the model below. While we aren’t actually training anything here since the model is pre-built, we’ll use the training process so we can store the artifact in the Qwak Build Repository so we can easily serve it as a real time endpoint.

Once our model has successfully built, we can deploy it as a real time endpoint. The command to deploy is demonstrated below. We’ll also add some deployment configurations and use an a10.2xl GPU instance as Llama2 7B requires quite a bit of resources to run.

You can find a full example of the Llama 2 implementation on Qwak examples repository here.

Deploying Embedding Model

Next, we’ll create a model that transforms and embeds our Qwak documentation text so that it can be persisted in the Qwak Vector store. For this model, we’ll be using the sentence transformer all-MiniLM-L12-v2 model from Hugging Face. We can use the same credentials from the first step of this tutorial.

Again, we’ll make a wrapper class around the Qwak Model Interface. Like the Llama build, we don’t need any complicated build logic since the model is pretrained. In the predict function, we transform our input text into a list, so that it can be handled by the SentenceTransformer encoding logic, define the batch size, and add a few configuration settings.

We return a DataFrame with field output and value of a list of vectors.

import qwak
from qwak.model.base import QwakModel
from qwak.model.schema import ModelSchema, ExplicitFeature
from sentence_transformers import SentenceTransformer
from pandas import DataFrame
from helpers import get_device

class SentenceEmbeddingsModel(QwakModel):
   def __init__(self):
       self.model_id = "sentence-transformers/all-MiniLM-L12-v2"
       self.model = None
       self.device = None

   def build(self):
       qwak.log_metric({"val_accuracy": 1})

   def schema(self):
       return ModelSchema(
           inputs=[
               ExplicitFeature(name="input", type=str),
           ]
       )

   def initialize_model(self):
       self.device = get_device()
       print(f"Inference using device: {self.device}")
       self.model = SentenceTransformer(
           model_name_or_path=self.model_id,
           device=self.device,
       )

   @qwak.api()
   def predict(self, df):
       text_embeds = self.model.encode(
           df["input"].values.tolist(),
           convert_to_tensor=True,
           normalize_embeddings=True,
           device=self.device,
           batch_size=128,
       ).tolist()
       return DataFrame({"embeddings": text_embeds})

We’ll also need to build and deploy the embedding model as a real time endpoint so we can call our embedding function when querying the vector store.

You can find an example of this model in our Qwak examples repository and step by step instructions for deployment below.

Clone the repository locally
Make sure you have the Qwak CLI installed and configured.
Go to the sentence_transformers_poetry directory.
Run make build to kick off the training job for this model. You can navigate to the Models -> Builds tab in the Qwak UI and monitor the progress of the build.
Now that the model has been successfully trained and stored in the Qwak Model Repository, you can run make deploy to take this build version and deploy it as a real-time endpoint. You can also monitor the Deployment steps by going to the Models -> Deployments tab in the Qwak UI.
When the Deployment completes, click on the Test Model tab in the upper right hand corner of the platform, and Qwak will generate example inference calls that you can use to call your real time endpoint and test your predictions live!

Managing Embeddings in Vector Store

Collections are a Qwak organizational feature that allow you to structure and manage your various vector groupings across the vector store. Collections allow you to specify the metric configuration(cosine, L2) as well as the number of vector dimensions, providing fine grained control over the grouping and indexing of your data.

You can create collections in the UI, or define the collections as code using the qwak-sdk. For this tutorial, you can see an example collection we’ve created below. We select the cosine metric for grouping and 384 dimensions to be used in the vector plane. We also need to select a vectorizer.

A vectorizer is a deployed Qwak model that accepts data as an input and returns a vector in its predict function — just like the model we created in the previous step! Here we select the sentence-transformer model that we just deployed, and the Qwak Vector Store will automatically use this model’s embedding function when preparing input data for insertion or searching the vector store, allowing us to send free text to the Qwak collections API.

Now that we have our model deployed and our collection in place, we are ready to start inserting our vectors.

Load Vectors

With our LLM, embedding model, and collection in place, we are ready to insert our documentation into the vector store. We’ve collected the documentation as a series of raw markdown files that are stored in a directory.

Once we read the Documentation markdown files, we need a way to split up the pages of data so that they can be usable vectors. This is a crucial step as it will drastically determine the performance and relevancy of your search results that you will feed into your LLM. If you parse too minutely, by line or every few words for example, the results will be very fine grained and you’ll be forced to pass in many records into your RAG prompt. If you parse loosely, say by each page of documentation, you run the risk of missing relevant results that might be spread over multiple pages. Ultimately, the logic and size of parsing requires experimentation, knowledge of your underlying data, and often a mix of different sizes or even duplication of vectors to produce the best results.

Since our Documentation has relatively concise sections that all begins with titles, we’ll split our lines on `##`, the markdown notation for header, so each major section will represent one vector.

After we read our documentation into chunks, we configure the properties that will accompany the vector once we store it in the Qwak vector store, including the text chunk_id, the chunk_parent_id or the page title, and the chunk_text.

import os
import tqdm
from typing import List
from uuid import uuid4
from qwak.vector_store import VectorStoreClient

def transform_record(record: str,
                    title: str,
                    max_chunk_size: int = 500,
                    overlap_size: int = 100
                    ) -> List[dict]:
   """Transform a single record as described in the prompt."""

   split = record.split("##")
   chunks = [s for s in split if s != '']
   transformed_records = []
   record_id = str(uuid4())
   for i, chunk in enumerate(chunks):
       chunk_id = f"{record_id}-{i + 1}"
       transformed_records.append({
           'chunk_id': chunk_id,
           'chunk_parent_id': title,
           'chunk_text': chunk
       })
   return transformed_records

def split_list_into_batches(input_list, batch_size):
   for i in range(0, len(input_list), batch_size):
       yield input_list[i:i + batch_size]

def insert_vector_data(chunks_array: List[dict],
                      collection_name: str,
                      batch_size: int = 10):
   client = VectorStoreClient()
   collection = client.get_collection_by_name(collection_name)

   batches = list(split_list_into_batches(chunks_array, batch_size))
   batches = tqdm.tqdm(batches)

   # Ingesting all the records in the vector store
   for batch in batches:
       print("Upserting batch to Vector Store")
       collection.upsert(
           ids=[
               str(uuid4()) for _ in range(len(batch))
           ],
           natural_inputs=[
               c["chunk_text"] for c in batch
           ],
           properties=batch
       )

def insert_text_into_vector_store(input_path: str, collection_name: str):
   """
   Inserts a file into the vector store
   :param input_path: The path of the file to be ingested
   """
   if os.path.isdir(input_path):
       for path in os.listdir(input_path):
           with open(input_path+"/"+path, 'r', encoding='ISO-8859-1') as f:
               contents = f.read()
               chunked_data = []
               chunk_array = transform_record(record=contents,title=input_path)
               for chunk in chunk_array:
                   chunked_data.append(chunk)

               insert_vector_data(chunks_array=chunk_array,
                               collection_name=collection_name)

insert_text_into_vector_store(
       input_path='data/qwak_docs',
       collection_name="qwak-docs"
)

With our data properly formatted, we’re ready to insert into the Qwak Vector store. We create an instance of the Qwak VectorStoreClient, and fetch the `qwak-docs` collection we created in the previous step. The collection object provides an upsert command that lets you directly insert raw data into the vector store. When raw data is upserted, Qwak will automatically call the sentence-transformer embedding model we defined when configuring the collection to convert the natural input into vectors before insertion and indexing.

The Upsert function has three fields

Ids — A unique identifier for each vector
Properties — a list of dictionaries containing the metadata properties of the vector you want to store
Natural Input — The raw input data that will be converted to vectors
Data (Alternative to Natural Input) — List of vectors to be stored in the DB — if the data is already converted to vectors

For this upsert, we’ll pass in uuid’s for the vector id’s, the article chunk text for the natural input field, and the entire parsed JSON object as the properties field.

Now that our data is stored in the Vector Store, we can query it to make sure it’s working. You can navigate to the Collections tab in the Qwak Platform, and directly query in the UI.

Additionally, you can search the vector store using the VectorStoreClient

### Query Vector Store using Qwak SDK 
search_results = collection.search(
   natural_input="vectors",
   top_results=5,
   output_properties=["chunk_id","chunk_text"],
   include_distance=True,
   include_vector=False
)

Build RAG Pipeline with LlangChain

Now it’s time to put it all together and implement our RAG model to make our LLM usable with our Qwak Documentation. We’ll use Llang Chain as the RAG implementation framework, and we’ll use Streamlit, which is a skeleton framework for generating a chat UI/API interface, for demoing our chat functionality.

The logic for the Streamlit application is below. We implement the get_text() and extract_answer() helper functions to allow us to handle the incoming prompt from the user and the output that is returned from the LLM. In the main() method, we implement the logic to create the chain object from Llang Chain, issue a query against the Qwak Vector using the prompt sent in from the user, then pass the results of the Vector query along with the initial prompt to the LLM to receive our context specific answer. Let’s dive a bit further into the chain implementation.

import streamlit as st
from streamlit_chat import message

from chain import llm_chain_response
from vector_store import retrieve_vector_context

QWAK_MODEL_ID = 'llama2'

def get_text() -> str:
   """
   Get the text from the user
   :return: string
   """
   input_text = st.text_input(label="You: ",
                              key="input")
   return input_text

def extract_answer(text):
   split_text = text.split("Answer:")
   return split_text[-1]

# StreamLit UI
left_co, cent_co, last_co = st.columns(3)
with cent_co:
   image = open("qwak.png", "rb").read()
   st.image(image, use_column_width="auto")

st.write("### Vector Store RAG Demo")

if "generated" not in st.session_state:
   st.session_state["generated"] = []

if "past" not in st.session_state:
   st.session_state["past"] = []

def main():
   """
   Streamlit app main method
   """

   # Get a new chat chain to query LLMs
   chat_chain = llm_chain_response(model_id=QWAK_MODEL_ID)

   use_content = st.checkbox(label="Use Vector Store", key="use-vector-store")
   user_input = get_text()

   if user_input:
       with st.spinner("Thinking..."):
           query = user_input
           if use_content:
               context = retrieve_vector_context(query=query,
                                                 collection_name="qwak-docs",
                                                 output_key='chunk_text',
                                                 top_results=10,
                                                 properties=[
                                                     "chunk_text"
                                                 ])
           else:
               context = ""
              
           output = chat_chain({
               "input": query,
               "context": context
           })
           answer = extract_answer(output["text"])

           # Add the responses to the chat state
           st.session_state.past.append(user_input)
           st.session_state.generated.append(answer)

   if st.session_state["generated"]:
       for i in range(len(st.session_state["generated"]) - 1, -1, -1):
           message(st.session_state["generated"][i], key=str(i))
           message(st.session_state["past"][i], is_user=True, key=str(i) + "_user", avatar_style="shapes")

if __name__ == "__main__":
   main()

The chain is a wrapper object used to define an LLM that we’ll be querying and the vector context to be passed in from the vector store query. The chain includes a default prompt template that can be used to cater/focus the context of the query. The chain packages together the initial query, as well as the relevant context returned from the Vector Store, to build one prompt with all the information needed to answer our specific questions related to the Qwak Documentation.

import streamlit as st
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from qwak_llm import Qwak

@st.cache_resource
def llm_chain_response(model_id: str) -> LLMChain:

   prompt_template = """
<<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
Answer the question based on the context below
<</SYS>>

[INST]
{context}
Question: {input}[/INST]
Answer:
"""

   prompt = PromptTemplate.from_template(prompt_template)

   llm = Qwak(
       model_id=model_id
   )

   chain = LLMChain(
       prompt=prompt,
       llm=llm,
   )

   return chain

For the LLM, you can choose from a number of different LLM objects available in LlangChain such as OpenAI or Bard for managed LLM’s. We’ve implemented our own Qwak LLM object that you can see below. The Qwak LLM object wraps the LLM interface from Llang Chain and defines a _call() method that specifies how to issue the query to the LLM. In the code below, you can see that we create an instance of the Qwak RealTimeClient, specifying the model_id from the Llama 2 model that we built and deployed in previous steps, and issue a query against that model using the predict() method.

import logging
from typing import Any, Dict, List, Mapping, Optional

import pandas as pd
from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.llms.base import LLM
from langchain.llms.utils import enforce_stop_tokens
from langchain.pydantic_v1 import Extra, Field, root_validator
from qwak_inference import RealTimeClient

logger = logging.getLogger(__name__)

class Qwak(LLM):
   """Qwak large language models.

   To use, you should have the ``qwak-inference`` python package installed.

   Any parameters that are valid to be passed to the call can be passed
   in, even if not explicitly saved on this class.

   Example:
       .. code-block:: python

           from langchain.llms import QwakLLM
           modal = QwakLLM(model_id="")

   """

   model_id: str = ""
   """Qwak model id to use"""

   model_kwargs: Dict[str, Any] = Field(default_factory=dict)
   """Holds any model parameters valid for `create` call not
   explicitly specified."""

   class Config:
       """Configuration for this pydantic config."""

       extra = Extra.forbid

   @root_validator(pre=True)
   def build_extra(cls, values: Dict[str, Any]) -> Dict[str, Any]:
       """Build extra kwargs from additional params that were passed in."""
       all_required_field_names = {field.alias for field in cls.__fields__.values()}

       extra = values.get("model_kwargs", {})
       for field_name in list(values):
           if field_name not in all_required_field_names:
               if field_name in extra:
                   raise ValueError(f"Found {field_name} supplied twice.")
               logger.warning(
                   f"""{field_name} was transferred to model_kwargs.
                   Please confirm that {field_name} is what you intended."""
               )
               extra[field_name] = values.pop(field_name)
       values["model_kwargs"] = extra
       return values

   @property
   def _identifying_params(self) -> Mapping[str, Any]:
       """Get the identifying parameters."""
       return {
           **{"model_id": self.model_id},
           **{"model_kwargs": self.model_kwargs},
       }

   @property
   def _llm_type(self) -> str:
       """Return type of llm."""
       return "qwak"

   def _call(
           self,
           prompt: str,
           stop: Optional[List[str]] = None,
           run_manager: Optional[CallbackManagerForLLMRun] = None,
           **kwargs: Any,
   ) -> str:
       """Call to Qwak RealTime model"""
       params = self.model_kwargs or {}
       params = {**params, **kwargs}

       columns = ["prompt"]
       data = [[prompt]]
       input_ = pd.DataFrame(data, columns=columns)
       # TODO - Replace this with a POST request so we don't import RealTimeClient
       client = RealTimeClient(model_id=self.model_id)

       response = client.predict(input_)
       try:
           text = response[0]["generated_text"][0]
       except KeyError:
           raise ValueError("LangChain requires 'generated_text' key in response.")

       if stop is not None:
           text = enforce_stop_tokens(text, stop)
       return text

We can now deploy the Streamlit/Rag service and use our interface. In a local python environment, make sure you have the Qwak SDk and Streamlit installed, and run streamlit run app.py . The service should open a browser on localhost and you can begin prompting. You can see an example of the query functionality below

You can find the full example project of this blog post here.

Conclusion

While RAG architecture is simple conceptually, there are several components that can be challenging to implement and deploy. With Qwak, you can easily build, deploy and orchestrate each of these services — LLM, Vector Store, custom embedding model, RAG service — and manage them all in one platform.

Originally published at https://www.qwak.com.