Implementing RAG with Langchain and Hugging Face

9 min readOct 16, 2023

Introduction

Retrieval Augmented Generation (RAG) is a pattern that works with pretrained Large Language Models (LLM) and your own data to generate responses.

It combines the powers of pretrained dense retrieval and sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate outputs.

In continuation of my article, where I walk you through the theory part of RAG, I am here to introduce you the implementation of RAG in codes.

Let’s get started with the implementation of RAG using Langchain and Hugging Face!

The Libraries

Before getting started, install all those libraries which are going to be important in our implementation.

!pip install -q langchain
!pip install -q torch
!pip install -q transformers
!pip install -q sentence-transformers
!pip install -q datasets
!pip install -q faiss-cpu

Import the libraries, which we are going to use in this implementation.

from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

Why are we installing and importing these libraries, you’ll know once we get started with the implementation step-by-step.

Document Loading

In the LangChain documentation, you can see that it has all the ways to load data from multiple sources that one wants to load.

But here, I am going to use hugging face dataset — databricks-dolly-15k. This dataset is an open source dataset of instruction-following records generated by Databricks, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

The behavioral categories are outlined in InstructGPT paper.

Document loaders provide a “load” method to load data as documents into the memory from a configured source. Using Hugging Face, load the data.

# Specify the dataset name and the column containing the content
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"  # or any other column you're interested in

# Create a loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# Load the data
data = loader.load()

# Display the first 15 entries
data[:2]

Document Transformers

Once the data is loaded, you can transform them to suit your application, or to fetch only the relevant parts of the document. Basically, it is about splitting a long document into smaller chunks which can fit your model and give results accurately and clearly.

There are several “Text Splitters” in LangChain, you have to choose according to your choice. I chose “RecursiveCharacterTextSplitter”. This text splitter is recommended for generic text. It is parametrized by a list of characters. It tries to split the long texts recursively until the chunks are smaller enough.


# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)

docs[0]

The result will be :

Document(page_content="Virgin Australia, the trading name of Virgin Australia
 Airlines Pty Ltd, is an Australian-based airline. It is the largest airline 
by fleet size to use the Virgin brand. It commenced services on 31 August 2000
 as Virgin Blue, with two aircraft on a single route. 
It suddenly found itself as a major airline in Australia's domestic market 
after the collapse of Ansett Australia in September 2001. 
The airline has since grown to directly serve 32 cities in Australia, 
from hubs in Brisbane, Melbourne and Sydney.", 
metadata={'instruction': 'When did Virgin Australia start operating?', 
'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin 
Blue, with two aircraft on a single route.', 
'category': 'closed_qa'})

Text Embedding

Embeddings capture the semnatic meaning of the text which allows you to quickly and efficiently find other pieces of text which are similar.

The Embeddings class of LangChain is designed for interfacing with text embedding models. You can use any of them, but I have used here “HuggingFaceEmbeddings”.


# Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:3]

The resulted vectors will be:

[-0.038338545709848404, 0.1234646886587143, -0.02864295244216919]

Vector Stores

There is a need of databases so that we can store those embeddings and efficiently search them. Therefore, for storage and searching purpose, we need vector stores. You can retrieve the embedding vectors which will be “most similar”. Basically, it does a vector search for you. There are many vector stores integrated with LangChain, but I have used here “FAISS” vector store.

db = FAISS.from_documents(docs, embeddings)

It depends on the length of your dataset, that for how long the above code will run. Mine took 5 minutes and 32 seconds.

If you’ll use “SQUAD” dataset, it’ll took 65 minutes. See the difference!

Now, search your question.

question = "What is cheesemaking?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

The results will be:

The goal of cheese making is to control the spoiling of milk into cheese. 
The milk is traditionally from a cow, goat, sheep or buffalo, although, 
in theory, cheese could be made from the milk of any mammal. 
Cow's milk is most commonly used worldwide. 
The cheesemaker's goal is a consistent product with specific characteristics 
(appearance, aroma, taste, texture). The process used to make a Camembert will 
be similar to, but not quite the same as, that used to make Cheddar.

Some cheeses may be deliberately left to ferment from naturally airborne 
spores and bacteria; this approach generally leads to a less consistent 
product but one that is valuable in a niche market.

It’s most similar search result!

Preparing the LLM Model

You can choose any model from hugging face, and start with a tokenizer to preprocess text and a question-answering model to provide answers based on input text and questions.

I used Intel/dynamic_tinybert which is a fine-tuned model for the purpose of question-answering.


# Create a tokenizer object by loading the pretrained "Intel/dynamic_tinybert" tokenizer.
tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")

# Create a question-answering model object by loading the pretrained "Intel/dynamic_tinybert" model.
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

Create a question-answering pipeline using your pre-trained model and tokenizer and then extend its functionality by creating a LangChain pipeline with additional model-specific arguments.

# Specify the model name you want to use
model_name = "Intel/dynamic_tinybert"

# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    "question-answering", 
    model=model_name, 
    tokenizer=tokenizer,
    return_tensors='pt'
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

Retrievers

Once the data is in database, the LLM model is prepared, and the pipeline is created, we need to retrieve the data. A retriever is an interface that returns documents from the query.

It is not able to store the documents, only return or retrieves them. Basically, vector stores are the backbone of the retrievers. There are many retriever algorithms in LangChain.

# Create a retriever object from the 'db' using the 'as_retriever' method.
# This retriever is likely used for retrieving data or documents from the database.
retriever = db.as_retriever()

Searching relevant documents for the question:

docs = retriever.get_relevant_documents("What is Cheesemaking?")
print(docs[0].page_content)

The results will be:

The goal of cheese making is to control the spoiling of milk into cheese. 
The milk is traditionally from a cow, goat, sheep or buffalo, although, 
in theory, cheese could be made from the milk of any mammal. Cow's milk 
is most commonly used worldwide. The cheesemaker's goal is a consistent 
product with specific characteristics (appearance, aroma, taste, texture). 
The process used to make a Camembert will be similar to, but not quite the 
same as, that used to make Cheddar.

Some cheeses may be deliberately left to ferment from naturally airborne 
spores and bacteria; this approach generally leads to a less consistent 
product but one that is valuable in a niche market.

It’s the same as the results we obtained from similarity search.

Retrieval QA Chain

Now, we’re going to use a RetrievalQA chain to find the answer to a question. To do this, we prepared our LLM model with “temperature = 0.7" and “max_length = 512”. You can set your temperature whatever you desire.

The RetrievalQA chain, which combines question-answering with a retrieval step. To create it, we use a language model and a vector database as a retriever. By default, we put all the data together in a single batch, where the chain type is “stuff” when asking the language model. But if we have a lot of information and it doesn’t all fit at once, we can use methods like MapReduce, Refine, and MapRerank.


# Create a retriever object from the 'db' with a search configuration where it retrieves up to 4 relevant splits/documents.
retriever = db.as_retriever(search_kwargs={"k": 4})

# Create a question-answering instance (qa) using the RetrievalQA class.
# It's configured with a language model (llm), a chain type "refine," the retriever we created, and an option to not return source documents.
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever, return_source_documents=False)

Finally, we call this QA chain with the question we want to ask.

question = "Who is Thomas Jefferson?"
result = qa.run({"query": question})
print(result["result"])

The results will be :

Thomas Jefferson (April 13, 1743 – July 4, 1826) was an American statesman, 
diplomat, lawyer, architect, philosopher, and Founding Father who served as 
the third president of the United States from 1801 to 1809. 
Among the Committee of Five charged by the Second Continental Congress with 
authoring the Declaration of Independence, Jefferson was the Declaration's 
primary author. Following the American Revolutionary War and prior to becoming 
the nation's third president in 1801, Jefferson was the first United States 
secretary of state under George Washington and then the nation's second vice 
president under John Adams.

Challenges with Open Source Models

When you will run your query, you will get the appropriate results, but with a ValueError, where it will ask you to input as a dictionary. While, in your final code, you have given your input as a dictionary.

Doing same methods and processes with OpenAI api, you will not get any error like this. For more information about how to use OpenAI with LangChain, see this cookbook.

Conclusion

However, apart from facing challenges while using open source libraries, it was fun to know RAG by implementing it over codes.

LangChain is an open-source developer framework for building large language model (LLM) applications.
It’s particularly useful when you want to ask questions about specific documents (e.g., PDFs, videos, etc) or chat with your own data.

With Hugging Face it’s easier for those who are regularly using it, however, it’s user friendly. There is a RAG Model you can find on the Hugging Face Model.