Ask Your Web Pages Using Mistral-7b & LangChain

Chat with Web Pages Using RAG — Mistral-7b, Hugging Face, LangChain, ChromaDB

Nour Eddine Zekaoui
13 min readNov 2, 2023

Introduction

How can you augment LLMs with data they weren’t trained on? Retrieval Augmented Generation (RAG) is the way to go. Let me explain what it means and how it actually works.

Let’s say that you’ve got your own dataset for example documents of text from your company. How can you make ChatGPT and other LLMs learn about it and answer questions?

Well, this can easily be done in four steps:

  1. Embedding: Embed your documents with an embedding model like text-embedding-ada-002 from OpenAI or S-BERT. Embedding a document means transforming its sentences or chunks of words into a vector of numbers. The idea is that sentences that are similar to each other should be close in terms of distance between its vectors and sentences that are different should be further away.
  2. Vector Store: Once you’ve got a list of numbers, you can store them in a vector store like ChromaDB, FAISS, or Pinecone. A vector store is like a database but as the name says, it indexes and stores vector embeddings for fast retrieval and similarity search.
  3. Query: Now that your document is embedded and stored, when you ask a specific question to an LLM, it will embed your query and find in the vector store the sentences that are the closest to your question in terms of cosine similarity for example.
  4. Answering Your Question: Once the closest sentences have been found, they are injected into the prompt and that’s it! LLMs are now able to answer specific questions on data that it wasn’t trained on, without any retraining or fine-tuning! How cool is that?

For more information about RAG, please check out this amazing video from IBM by Marina Danilevsky a Senior Research Scientist at IBM.

Pushing Boundaries

In my last blog post titled Bring Your Own Data to LLMs Using LangChain & LlamaIndex, we augmented the knowledge of the ChatGPT, deployed on Azure, using private data to make it more factual. In other words, we implemented an RAG application in which two commercial models from Azure OpenAI were employed, namely the ada-embeddings-001 model for embeddings and GPT-3.5 Turbo, which generates responses based on user questions or queries and relevant context. This relevant context consists of documents retrieved from a search step using semantic search.

One unfortunate reality is that this service is not free. In other words, you are charged for each token generated by the model, which is essentially one of the drawbacks of commercial and closed-source models. AI should ideally be open-sourced and democratized, as Clem Delangue, CEO at 🤗, has pointed out:

It is incredibly hard to start without open models & datasets.

Getting Started

We have not said everything yet; just fasten your seat belts. We will now embark on a high-speed coding journey to demonstrate how you can build a completely free RAG system using open-source models hosted on the Hugging Face Model Hub and code each component in its architecture.

RAG Architecture — Image By Author

Installations

These lines of code are installing several Python libraries and packages using the pip package manager, and the — quiet flag is used to reduce the amount of output displayed during the installation process, making it less verbose.

!pip install gradio --quiet
!pip install xformer --quiet
!pip install chromadb --quiet
!pip install langchain --quiet
!pip install accelerate --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet
!pip install unstructured --quiet
!pip install sentence-transformers --quiet

Imports

In the following script, we import a wide range of libraries and modules for advanced natural language processing and text generation tasks. Essentially, we are setting up an environment for working with language models, including Hugging Face models, as well as various tools and utilities for handling and processing text data

We mainly import PyTorch for deep learning capabilities and Gradio for building interactive ML model interfaces. Additionally, we import modules from the LangChain library, which include templates for creating prompts, various chain models for language understanding and generation, text embeddings, and document loaders. Our code also integrates the powerful Transformers library, which allows for seamless use of Hugging Face’s state-of-the-art models for a wide range of NLP applications.

import torch
import gradio as gr

from textwrap import fill
from IPython.display import Markdown, display

from langchain.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)

from langchain import PromptTemplate
from langchain import HuggingFacePipeline

from langchain.vectorstores import Chroma
from langchain.schema import AIMessage, HumanMessage
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredMarkdownLoader, UnstructuredURLLoader
from langchain.chains import LLMChain, SimpleSequentialChain, RetrievalQA, ConversationalRetrievalChain

from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

import warnings
warnings.filterwarnings('ignore')

Base LLM

Mistral-7b developed by Mistral AI is taking the Open Source LLM landscape by storm. This new open-source LLM outperforms LLaMA-2 on many benchmarks, as illustrated in the following image taken from its paper:

Performance of Mistral 7B and different Llama models on a wide range of benchmarks.

The following code snippet sets up a text generation pipeline using a base LLM, Mistral-7b developed by Mistral AI, instruct pre-trained language model, configures it with quantization settings, tokenization, and generation parameters, and creates a pipeline that can be used for generating text based on the Mistral-7b LLM and configurations. Let’s break down what’s happening:

  • quantization_config = BitsAndBytesConfig(...): Here, a quantization configuration is defined using the BitsAndBytesConfig. Quantization is a technique used to reduce the memory and computation requirements of deep learning models, typically by using fewer bits, 4 bits in our case to represent model parameters.
  • tokenizer = AutoTokenizer.from_pretrained(...): This line initializes a tokenizer for the Mistral-7b model, allowing you to preprocess text data for input to the model.
  • model = AutoModelForCausalLM.from_pretrained(...): This initializes the pre-trained language Mistral-7b model for causal language modeling. The model is configured with various parameters, including the quantization configuration, which was set earlier.
  • generation_config = GenerationConfig.from_pretrained(...): A generation configuration is created for the model, specifying various generation-related settings, such as the maximum number of tokens, temperature for sampling, top-p sampling, and repetition penalty.
  • pipeline = pipeline(...): Finally, a text generation pipeline is created using the pipeline function. This pipeline is set up for text generation, and it takes the pre-trained model, tokenizer, and generation configuration as inputs. It's configured to return full-text outputs.
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16,
trust_remote_code=True,
device_map="auto",
quantization_config=quantization_config
)

generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.0001
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15

pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=True,
generation_config=generation_config,
)

HuggingFacePipeline is a class that allows you to run Hugging Face models locally. It is used to access and utilize a wide range of pre-trained ML models hosted on the Hugging Face Model Hub. In our case, we will use it within our LangChain environment to interact with Hugging Face models as a local wrapper. However, when working with the HuggingFacePipeline, the installation of xformer is recommended for more memory-efficient attention implementation. This is why we have installed it above!

llm = HuggingFacePipeline(
pipeline=pipeline,
)

Let’s have some fun! Before connecting our base LLM, Mistral-7b, to our private data. First, let’s ask it some general questions. Of course, it will respond based on the general knowledge it acquired during pre-training.

query = "Explain the difference between ChatGPT and open source LLMs in a couple of lines."
result = llm(
query
)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

Very powerful! Indeed, what would happen if we asked it a question for which it might lack knowledge because it probably would not have encountered it during its pre-training phase? I am referring to the GenIA Ecosystem implemented by Hiberus.

query = "What is Hiberus GenIA Ecosystem?"
result = llm(
query
)

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

Disappointed! This is not the expected answer. The GenIA Ecosystem is even cooler than that, and this is because Mistral-7b LLM has never seen any information about the GenIA ecosystem during its pre-training. I promise to walk you through obtaining the correct answer in the upcoming sections.

Embeddings

After setting our base LLM, it is time to set an embedding model. As you know, each document should be converted into an embedding vector to enable semantic search using the user’s query, which should also be embedded. To achieve this, we will utilize the embedding model GTE trained by Alibaba DAMO Academy and hosted on Hugging Face. It’s worth noting that this model is both free and powerful. To get our task done, we will use the HuggingFaceEmbeddings class, a local pipeline wrapper for interacting with the GTE model hosted on Hugging Face Hub.

embeddings = HuggingFaceEmbeddings(
model_name="thenlper/gte-large",
model_kwargs={"device": "cuda"},
encode_kwargs={"normalize_embeddings": True},
)

Prompt Template

Did you know that we can give our base LLM an identity and make it behave according to our preferences, controlling the model’s output without explicitly specifying everything in the user’s query or prompt? This is achieved through prompt templates, which are pre-defined recipes for generating prompts for language models. In other contexts, giving an LLM an identity can be done through a System Message instead.

We use PromptTemplate to create a structured prompt. A template may include instructions, n-shot examples, and specific context and questions suitable for a particular task.

template = """
[INST] <>
Act as a Machine Learning engineer who is teaching high school students.
<>

{text} [/INST]
"""

prompt = PromptTemplate(
input_variables=["text"],
template=template,
)

Let’s ask!

query = "Explain what are Deep Neural Networks in 2-3 sentences"
result = llm(prompt.format(text=query))

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

Data Loading

Data Indexing — Image By Author

To obtain an accurate answer to our previous question, What is the Hiberus GenIA Ecosystem? we will have to connect our LLM with information about the GenIA Ecosystem.

We’re in luck! There are two web pages that hold the key to understanding the GenIA Ecosystem. These web pages 🌐 can be found right on the Hiberus website. They’re like treasure troves of information, offering in-depth insights into this groundbreaking ecosystem recently launched by Hiberus.

Now, you might be wondering how to proceed with this data-loading mission. Fortunately, we have a script that’s up to the task. Let’s take a look at it: The UnstructuredURLLoader is your magic wand for obtaining the information you seek. Once you run this script, you'll have a collection of documents at your disposal, each holding a piece of the GenIA puzzle. Basically, two documents, one for each link.

urls = [
"https://www.hiberus.com/expertos-ia-generativa-ld",
"https://www.hiberus.com/en/experts-generative-ai-ld"
]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

len(documents)
# Output

We’ve got two hefty documents overflowing with data, and that might just stretch our Mistral-7b LLM’s context window. To keep everything in check, we’re breaking them into 21 smaller documents or chunks, each with a 1024-token limit. Additionally, we’ve set the chunk overlap size to 64 to ensure there’s some context continuity between consecutive chunks. Stay tuned for the next step in managing this data adventure!

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts_chunks = text_splitter.split_documents(documents)

len(texts_chunks)

Data Ingestion

After we’ve got our manageable data chunks, the next step is to embed and index them in Chromdb, our vector store. The best part? It’s a breeze and can be accomplished with just a single line of code!

db = Chroma.from_documents(texts_chunks, embeddings, persist_directory="db")

Once our data is indexed, in the script below, we tweak our prompt template to match our needs and give our RAG model the persona of a Marketing Manager Expert!

Moreover, to combine our LLM with the vector database retrieval capabilities, we use the crucial chaining component RetrievalQA with k=2. This setup ensures that the retriever outputs two relevant chunks, which are then used by the LLM to formulate the answer when a question is presented.

template = """
[INST] <>
Act as an Hiberus marketing manager expert. Use the following information to answer the question at the end.
<>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=db.as_retriever(search_kwargs={"k": 2}),
return_source_documents=True,
chain_type_kwargs={"prompt": prompt},
)

Querying

Awesome! Our RAG system is all set to answer your questions. So, let’s dive in and ask it some questions — including the one we missed earlier, just in case you’ve forgotten it. Let’s have some fun!

Example #1

query = "What is GenAI Ecosystem?"
result_ = qa_chain(
query
)
result = result_["result"].strip()


display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

Example #2

query = "Why Hiberus has created GenAI Ecosystem?"
result_ = qa_chain(
query
)
result = result_["result"].strip()


display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

No comments! We’ve got great answers for both questions, including the one we missed earlier. It should be noted that we can print the source documents or reference documents from which the LLM has generated the answers. The stage now is yours; consider the following line as the starting point for your exploration.

result_["source_documents"]

Follow-Up Q/A

In real-world scenarios, follow-up chat is useful, especially with conversational AI assistants. It enables users to engage in natural conversations with the model while retaining chat history in the model’s context. This means users can implicitly refer to something they have discussed in previous chat messages or bring up topics you’ve chatted about in the past. It’s like having a friendly chat with a helpful AI buddy! 🗨️💬

To make this happen, we first make a few tweaks to the prompt template. Then, we use ConversationBufferMemory for storing the conversation in-memory and then retrieving the messages later on. Finally, then employ the chaining component, ConversationalRetrievalChain to combine our LLM, Mistral-7b, with the vector database and chat history. This is in order to enhance the user conversation experience!

Image By Author
custom_template = """You are an Hiberus Marketing Manager AI Assistant. Given the
following conversation and a follow up question, rephrase the follow up question
to be a standalone question. At the end of standalone question add this
'Answer the question in English language.' If you do not know the answer reply with 'I am sorry, I dont have enough information'.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:
"""

CUSTOM_QUESTION_PROMPT = PromptTemplate.from_template(custom_template)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=db.as_retriever(search_kwargs={"k": 2}),
memory=memory,
condense_question_prompt=CUSTOM_QUESTION_PROMPT,
)

Let’s ask again!

query = "Who you are?"
result_ = qa_chain({"question": query})
result = result_["answer"].strip()

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

Another question!

query = "What is GenIA Ecosystem?"

result_ = qa_chain({"question": query})
result = result_["answer"].strip()

display(Markdown(f"<b>{query}</b>"))
display(Markdown(f"<p>{result}</p>"))

If you’re still unsure about your chat history, you can run these code snippets to take a look at your questions in HumanMessages and the model responses in AIMessages. This will give you a clear view of the conversation and help address any doubts you might have. It’s a handy way to keep track of the interaction! 🕵️‍♂️💬

memory.chat_memory.messages

Gradio Chat UI

Gradio is your speedy ticket to demo your RAG model with a user-friendly web interface that anyone can access from anywhere! Here’s how it works: we’ve set up a nifty function called querying(). It takes the query as its main input, along with a cleverly named fake argument called history to resolve a minor issue. When you fire up this function, it returns the response generated by our superstar model, Mistral-7b. It's as simple as that! 🚀

def querying(query, history):
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=db.as_retriever(search_kwargs={"k": 2}),
memory=memory,
condense_question_prompt=CUSTOM_QUESTION_PROMPT,
)

result = qa_chain({"question": query})
return result["answer"].strip()

Launch the Gradio chat interface.

iface = gr.ChatInterface(
fn = querying,
chatbot=gr.Chatbot(height=600),
textbox=gr.Textbox(placeholder="What is GenAI Ecosystem?", container=False, scale=7),
title="HiberusBot",
theme="soft",
examples=["Why Hiberus has created GenAI Ecosystem?",
"What is GenAI Ecosystem?"],

cache_examples=True,
retry_btn="Repetir",
undo_btn="Deshacer",
clear_btn="Borrar",
submit_btn="Enviar"

)

iface.launch(share=True)

Your final user interface should look like the following images, amazing! Isn’t it? 🚀

Gradio UI — Image by Author

Conclusion

RAG applications are turning the AI landscape upside down, thanks to the leaps made by large language models. Tools like LangChain, LlamaIndex, and similar frameworks are paving the way for swift development of applications that tap into the full potential of LLMs. This includes augmenting and enhancing LLMs’ knowledge with private data like PDFs, URLs, videos, and more, data that they’ve never encountered during their initial training.

Indeed, we haven’t mentioned that you can also create an RAG application using data from the entire internet, not just a few links or web pages. You can achieve this by first employing a retriever to dynamically fetch relevant web pages from the internet, making use of Google Search APIs, for instance, or any other alternative. Then, you can use a re-ranker to sort and rank the content from all the retrieved web pages, providing the LLM with the relevant context needed to generate the perfect answer for a given query.

The exciting part? RAG can be also securely implemented within the cloud. You’ve got options like OpenAI On Your Data within Azure, Amazon Bedrock, and a whole array of services in GCP. It’s a revolution in AI with limitless possibilities! 🚀💻.

You can access the complete code in my GitHub repository, you can find the Colab notebook below:

Thanks for reading, if you like it, give the 👏 and follow me to see my AI posts in your feed about state-of-the-art models in the future. Stay tuned, and see you soon! 📻👋😊

References

  1. Lewis et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.
  2. Jiang, A.Q et al. 2023. Mistral 7B. ArXiv, abs/2310.06825.
  3. Li, Z., Zhang et al. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. ArXiv, abs/2308.03281.
  4. Venelin Valkov. 2023. Mistral 7B — better than Llama 2? | Getting started, Prompt template & Comparison with Llama 2.

--

--

Nour Eddine Zekaoui

ML Engineer, focusing on bringing AI and ML to edge devices.