NADSOFT

Technology implementation showcases

Building an Open Source Multi-Modal RAG System

--

In this new adventure, we will delve into the process of constructing a Retrieval-Augmented Generation (RAG) system using an Open Source Large Language Multi-Modal (LLMM). Notably, our focus will be on achieving this without relying on LangChain or Llama index; instead, we will leverage ChromeDB and the Hugging Face framework.

Let’s embark on this journey to explore and understand how to create an efficient RAG system, combining the power of open-source technologies such as ChromeDB and Hugging Face in the realm of Large Language Multi-Modal applications.

The Map 🗺️

  1. What is RAG (Retrieval-Augmented Generation)?
  2. Why RAG?
  3. What is multi-modals?
  4. What is MLLM (multi-modal large language )?
  5. Building RAG pipeline?
  6. References and the code

What is RAG?

Retrieval-Augmented Generation (RAG): Enhancing AI Understanding and Output

In the realm of artificial intelligence, Retrieval-Augmented Generation (RAG) stands out as a transformative technique, refining the capabilities of Large Language Models (LLMs). At its essence, RAG enhances the specificity of AI responses by allowing models to dynamically retrieve real-time information from external sources.

Large Language Models, like GPT-3, excel in generating human-like language but face limitations in providing up-to-date or domain-specific information. RAG addresses this by integrating a retrieval mechanism that pulls in relevant facts from external knowledge bases, ensuring responses are both linguistically sound and factually accurate.

The architecture seamlessly combines generative abilities with a dynamic retrieval process, enabling AI to adapt to evolving information in various domains. Unlike extensive retraining, RAG offers a cost-effective solution, allowing AI to stay current and relevant without overhauling the entire model.

In other Words

Imagine you have a super smart robot friend. This robot friend is good at talking and saying smart things, but sometimes it doesn’t know everything. Now, we have a special trick called Retrieval-Augmented Generation, or RAG for short.

RAG helps the robot friend become even smarter by looking up information from a big book of facts when it needs to answer a question or talk about something specific. So, instead of just saying things from its brain, it can now check this big book to make sure it’s giving the best and most accurate answers. It’s like having a cool encyclopedia for the robot friend, making it even more awesome to chat with us!

Why RAG?

  1. Enhanced Accuracy and Reliability:
  • RAG addresses the unpredictability of Large Language Models (LLMs) by redirecting them to authoritative knowledge sources.
  • It mitigates the risk of presenting false or outdated information, ensuring more accurate and reliable responses.

2. Increased Transparency and Trust:

  • Generative AI models, like LLMs, often lack transparency, making it challenging to trust their outputs.
  • RAG introduces transparency by allowing organizations to have greater control over the generated text output, addressing concerns about bias, reliability, and compliance.

3. Mitigation of Hallucinations:

  • LLMs are prone to generating hallucinated responses — coherent but inaccurate or fabricated information.
  • RAG helps mitigate this issue by ensuring that responses are grounded in authoritative sources, reducing the risk of misleading recommendations in critical sectors like finance.

4. Improved Decision-Making in High-Stakes Environments:

  • In sectors like finance, where accuracy, credibility, and timeliness are paramount, RAG significantly enhances performance.
  • Real-time updates and reliance on authoritative sources reduce the chance of catastrophic losses, regulatory issues, or costly mistakes in decision-making processes.

5. Cost-Effective Adaptability:

  • RAG provides a cost-effective approach to improving AI output without the need for extensive retraining / Fine-tuning.
  • Organizations can stay current and relevant by dynamically fetching specific details as needed, ensuring the AI’s adaptability to evolving information.

What is Multi-modal?

Dear adventurer, consider this: when you hear someone’s voice, you recognize the person, and when you see them, you also know who they are. In essence, multi-modality involves having two inputs — audio and visual — and producing a single output, allowing for a richer and more comprehensive understanding.

In other detailed words with CLIP as an example

In simple terms, multimodal learning involves teaching computers / AI models to understand and learn from different types of information, like images, text, or speech. This is useful because it allows models to make better predictions, mimicking how humans learn.

The model makes same (very similar) embedding vectors for different inputs that represent the same thing.

like

  1. Image2Text: This part focuses on improving the captioning of complex images using transformer-based architectures.
  2. Text2Image: Here, the idea is to use textual input to generate visual representations. Advances in Natural Language Processing (NLP) enable the encoding of text into embedding vectors, guiding the image generation process.
  3. Images supporting Language Models focuses on integrating visual elements into pure textual language models. While traditional models assume word meaning from text context alone, this task explores incorporating visual dimensions to enhance language models.

one good model as an example for it is OpenAI CLIP

The CLIP model from open AI learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot”

In a simple manner, it generates the same (very similar) vector for an image of a cat and the word ‘cat’.

What is MLLM (multi-modal large language)?

“Yes, you are right, the LLM is capable of vision.”

Like GPT4v and Gemini vision

the exploration of multimodal language models (MLLM) that integrate various data types, including images, text, language, audio, and more. While large language models (LLMs) like GPT-3, BERT, and RoBERTa excel in text-based tasks, they face challenges in understanding and processing other data types. To address this limitation, multimodal models combine different modalities, enabling a more comprehensive understanding of diverse data.

Multi-Modal Large Language Models (MLLM): Revolutionizing Data Understanding

Multimodal large language models (MLLM) represent a paradigm shift in natural language processing, extending beyond traditional text-based approaches. These models, exemplified by GPT-4, can seamlessly process diverse data types, including images and text, leading to a more holistic understanding of information. MLLMs address the limitations of pure text models by integrating various modalities, showcasing human-level performance in benchmark tests.

Building RAG pipeline?

We plan to create the RAG pipeline, which involves embedding images and texts using Clip. Subsequently, we intend to store this embedded data in the ChromDB vector database. Finally, we will leverage the MLLM from Hugging Face to engage in user chat sessions based on the retrieved information.

we will create a flower expert chat bot using images from Kaggle and informations from Wikipedia

  1. Install needed packages
! pip install -q timm einops wikipedia chromadb open_clip_torch
!pip install -q transformers==4.36.0
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0

2. preprocess the data ==> In this step you should do on your own but I just put the images and text in one folder like this

3. Create the vector database; feel free to utilize any tool, but I recommend using ChromaDB.

3.1 first you need to determine the embedding function. I will use the default and show you how to create a custom one

import chromadb

from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader
from chromadb.config import Settings


client = chromadb.PersistentClient(path="DB")

embedding_function = OpenCLIPEmbeddingFunction()
image_loader = ImageLoader() # must be if you reads from URIs

Custom Embed function

from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# embed the documents somehow or images
return embeddings

3.2 we will create 2 collections one for texts other for images

collection_images = client.create_collection(
name='multimodal_collection_images',
embedding_function=embedding_function,
data_loader=image_loader)

collection_text = client.create_collection(
name='multimodal_collection_text',
embedding_function=embedding_function,
)

# Get the Images
IMAGE_FOLDER = '/kaggle/working/all_data'


image_uris = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if not image_name.endswith('.txt')])
ids = [str(i) for i in range(len(image_uris))]

collection_images.add(ids=ids, uris=image_uris) #now we have the images collection

. We use Clip we can retrieve images using text like this

from matplotlib import pyplot as plt

retrieved = collection_images.query(query_texts=["tulip"], include=['data'], n_results=3)
for img in retrieved['data'][0]:
plt.imshow(img)
plt.axis("off")
plt.show()
there are more images but for space

. or using the image itself

from PIL import Image
import numpy as np

query_image = np.array(Image.open(f"/kaggle/input/flowers/flowers/daisy/0.jpg"))
print("Query Image")
plt.imshow(query_image)
plt.axis('off')
plt.show()

print("Results")
retrieved = collection_images.query(query_images=[query_image], include=['data'], n_results=3)
for img in retrieved['data'][0][1:]:
plt.imshow(img)
plt.axis("off")
plt.show()

3.3 The Text Collection

# now the text DB
from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()

text_pth = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if image_name.endswith('.txt')])

list_of_text = []
for text in text_pth:
with open(text, 'r') as f:
text = f.read()
list_of_text.append(text)

ids_txt_list = ['id'+str(i) for i in range(len(list_of_text))]
ids_txt_list

collection_text.add(
documents = list_of_text,
ids =ids_txt_list
)

3.4 Retrieve text We used CLIP too in embedding so we can get by text or Embedings

results = collection_text.query(
query_texts=["What is the bellflower?"],
n_results=1
)

results
{'ids': [['id0']],
'distances': [[0.6072186183744086]],
'metadatas': [[None]],
'embeddings': None,
'documents': [['Campanula () is the type genus of the Campanulaceae family of flowering plants. Campanula are commonly known as bellflowers and take both their common and scientific names from the bell-shaped flowers—campanula is Latin for "little bell".\nThe genus includes over 500 species and several subspecies, distributed across the temperate and subtropical regions of the Northern Hemisphere, with centers of diversity in the Mediterranean region, Balkans, Caucasus and mountains of western Asia. The range also extends into mountains in tropical regions of Asia and Africa.\nThe species include annual, biennial and perennial plants, and vary in habit from dwarf arctic and alpine species under 5 cm high, to large temperate grassland and woodland species growing to 2 metres (6 ft 7 in) tall.']],
'uris': None,
'data': None}

or using Embeddings

query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg'
raw_image = Image.open(query_image)

doc = collection_text.query(
query_embeddings=embedding_function(query_image),

n_results=1,

)['documents'][0][0]

A rose is either a woody perennial flowering plant of the genus Rosa (), in the family Rosaceae (), or the flower it bears. There are over three hundred species and tens of thousands of cultivars. They form a group of plants that can be erect shrubs, climbing, or trailing, with stems that are often armed with sharp prickles. Their flowers vary in size and shape and are usually large and showy, in colours ranging from white through yellows and reds. Most species are native to Asia, with smaller numbers native to Europe, North America, and northwestern Africa. Species, cultivars and hybrids are all widely grown for their beauty and often are fragrant. Roses have acquired cultural significance in many societies. Rose plants range in size from compact, miniature roses, to climbers that can reach seven meters in height. Different species hybridize easily, and this has been used in the development of the wide range of garden roses.

4. now we should load the MLLM

I used a small one according to its repo this is how we can use it

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
from modeling_llava import LlavaForConditionalGeneration
import torch

model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b")
model = model.to("cuda")
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor

image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)

5. let’s use it

question = 'Answer with organized answers: What type of rose is in the picture? Mention some of its characteristics and how to take care of it ?'

query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg'
raw_image = Image.open(query_image)

doc = collection_text.query(
query_embeddings=embedding_function(query_image),

n_results=1,

)['documents'][0][0]

plt.imshow(raw_image)
plt.show()
imgs = collection_images.query(query_uris=query_image, include=['data'], n_results=3)
for img in imgs['data'][0][1:]:
plt.imshow(img)
plt.axis("off")
plt.show()

according to our input image, this is the most similar images we have

and this is the document has most information we need

now let’s make the inputs ready for the model

prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant is an exprt in flowers , and gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
{question} Use the following article as an answer source. Do not write outside its scope unless you find your answer better {article} if you thin your answer is better add it after document.<|im_end|>
<|im_start|>assistant
""".format(question='question', article=doc)
inputs = processor(prompt, raw_image, model, return_tensors='pt')

inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
from transformers import TextStreamer

streamer = TextStreamer(tokenizer)
%%time
output = model.generate(**inputs, max_new_tokens=300, do_sample=True, top_p=0.5, temperature=0.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))

A beautiful dark purple rose is in full bloom against a white background. The rose has a velvety texture and is surrounded by green leaves.

the model is not good you can use other prompt or other model

Follow Me and NADSOFT for more

--

--

NADSOFT
NADSOFT
Ahmed Haytham
Ahmed Haytham

Written by Ahmed Haytham

Data Scientist @ NADSOFT | Creating AI solutions | And Yes, I have a life beyond AI, maybe

Responses (1)