Retrieval in LangChain: Part 1 — Document Loaders

4 min readMar 9, 2024

In this new series, we will explore Retrieval in Langchain — Interface with application-specific data.

What is RAG?

RAG (Retrieval Augmented Generation) is a framework that can be used to improve the performance of an LLM by feeding facts from external sources to refer outside its training data to generate responses. This enables improving the reliability of the generated responses and helps in handling hallucinations. By using RAG, the LLM can connect directly to updated information sources that can provide the latest information to the user.

How does RAG work?

The external data is given to the LLM and stored in the form of a vector database.
Then the user query is converted to a vector representation and matched with the vector database.
Next, the RAG model augments the user input and adds the input to the relevant retrieved content.
The augmented prompt will allow the LLM to generate accurate responses to the user queries.

RAG has various components from loading the data to retrieving the relevant information. In this blog, we will explore the first component involved in the process of Retrieval which is ‘Document Loaders’.

The very first step of retrieval is to load the external information/source which can be both structured and unstructured. Langchain provides the user with various loader options like TXT, JSON, CSV, HTML, PDF, public websites, etc. Let's see how the loaders work.

CSV Loader:

from langchain.document_loaders import CSVLoader

loader = CSVLoader(file_path='../datasets/sns_datasets/titanic.csv') # Lazy Loader
data = loader.load()

data

> [Document(page_content='survived: 0\npclass: 3\nsex: male\nage: 22.0\nsibsp: 1\nparch: 0\nfare: 7.25\nembarked: S\nclass: Third\nwho: man\nadult_male: True\ndeck: \nembark_town: Southampton\nalive: no\nalone: False', metadata={'source': '../datasets/sns_datasets/titanic.csv', 'row': 0}), Document(page_content='survived: 1\npclass: 1\nsex: female\nage: 38.0\nsibsp: 1\nparch: 0\nfare: 71.2833\nembarked: C\nclass: First\nwho: woman\nadult_male: False\ndeck: C\nembark_town: Cherbourg\nalive: yes\nalone: False', metadata={'source': '../datasets/sns_datasets/titanic.csv', 'row': 1}), ...]

The loader creates a separate document for each of the rows in the CSV.

# Getting the content of the document
print(data[0].page_content) 

# Getting the metadata
print(data[0].metadata)

# content of the document
survived: 0
pclass: 3
sex: male
age: 22.0
sibsp: 1
parch: 0
fare: 7.25
embarked: S
class: Third
who: man
adult_male: True
deck: 
embark_town: Southampton
alive: no
alone: False

#metadata information 
> {'source': '../datasets/sns_datasets/titanic.csv', 'row': 0}

Using source_column, the user can mention a specific column and pass it to the loader.

2. HTML Loader:

from langchain.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader('../datasets/harry_potter_html/001.htm')
data = loader.load()
data

> [Document(page_content='A Day of Very Low Probability\n\nBeneath the moonlight glints a tiny fragment of silver, a fraction of a line…\n\n ...

3. Markdown Loader:

from langchain.document_loaders import UnstructuredMarkdownLoader

md_filepath = "../datasets/harry_potter_md/001.md"

loader = UnstructuredMarkdownLoader(file_path=md_filepath)
data = loader.load()
data

  > [Document(page_content='A Day of Very Low Probability\n\nBeneath the moonlight glints a tiny fragment of silver ...

4. PDF Loader:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('../datasets/harry_potter_pdf/hpmor-trade-classic.pdf')
data = loader.load()
data

[Document(page_content='Harry Potter and the Methods of Rationality', metadata={'source': '../datasets/harry_potter_pdf/hpmor-trade-classic.pdf', 'page': 0}),
Document(page_content='', metadata={'source': '../datasets/harry_potter_pdf/hpmor-trade-classic.pdf', 'page': 1}), ...

The PDF loader utilizes different sources like PyPDFium2Loader, PDFMinerLoader, PDFMinerPDFasHTMLLoader, PyMuPDFLoader, and PyPDFDirectoryLoader.

5. Wikipedia Loader:

from langchain.document_loaders import WikipediaLoader

loader = WikipediaLoader(query='India', load_max_docs=1)
data = loader.load()
data

> [Document(page_content="India, officially the Republic of India (ISO: Bhārat Gaṇarājya), ...

6. ArXiv Loader:

from langchain_community.document_loaders import ArxivLoader

loader = ArxivLoader(query='2201.03916', load_max_docs=1) # AutoRL paper (article ID -> 2201.03916)
data = loader.load()
data

> [Document(page_content='Journal of Artiﬁcial Intelligence Research 74 (2022) ...

7. Youtube Loader:

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser())

data = loader.load()
data[0]

Let’s pass the retrieved information to the LLM

from langchain_openai import ChatOpenAI
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache


chat = ChatOpenAI()
set_llm_cache(InMemoryCache())

# Setting up the prompt templates

from langchain.prompts.chat import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate

system_template = "You are Peer Reviewer"
human_template = "Read the paper with the title: '{title}'\n\nAnd Content: {content} and critically list down all the issues in the paper"

system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages(messages=[system_message_prompt, human_message_prompt])
prompt = chat_prompt.format_prompt(title=data[0].metadata['Title'], content=data[0].page_content)
messages = prompt.to_messages()

response = chat(messages=messages)

print(response.content)

Creating a bot that can answer questions based on Wikipedia articles

def qna_article(topic, question):
    chat = ChatOpenAI(max_tokens = 500)
    loader = WikipediaLoader(query=topic, load_max_docs = 1)
    data = loader.load()
    first_record = data[0]
    page_content = first_record.page_content
    title = first_record.metadata['title']
    summary = first_record.metadata['summary']
    user_question = question

    human_template = "Read the paper with the title: '{title}'\n\n And Content: {content}  and answer the questions {user_question} related to the article"

    human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

    chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
    prompt = chat_prompt.format_prompt(title = title, content = summary, user_question = question)

    response = chat(messages = prompt.to_messages())

    return response.content

qna_article('India', 'How many languages are being spoken in India?')

'India is a multilingual country with a diverse linguistic landscape. There are 22 officially recognized languages in India, as listed in the Eighth Schedule of the Indian Constitution. In addition to these official languages, there are hundreds of other languages spoken by different communities across the country.'

That's all about document loaders.

Thanks for reading.

Reference: https://python.langchain.com/docs/modules/data_connection/document_loaders/

Retrieval in LangChain: Part 1 — Document Loaders

What is RAG?

How does RAG work?

Written by Sushmitha