Chat with Document: Basics and Demonstrations

Data Mastery Series — Episode 20: The Chat with Document and Langchain Series (Part 1)

Donato_TH
Donato Story
8 min readApr 27, 2024

--

Connect with me and follow our journey: Linkedin, Facebook

Imagine having a conversation with your documents! Chat with Document technology lets you do just that. It uses artificial intelligence (AI) to understand and answer your questions directly from your files. This saves you time searching through text and makes finding information much easier.

Introduction to Chat with Document

Chat with Document technology lets AI systems interact intelligently with text, pulling information directly from documents to answer user questions. This ability changes how we handle data, making it easier and faster to access information. Documents come in two main types:

  • Structured data: Organized formats like tables or spreadsheets, commonly found in Excel or CSV files.
  • Unstructured data: More freeform content like emails or word documents.

Traditionally, structured data could be queried using tools like Python or SQL, but unstructured data was more challenging to handle. Now, with language models, we can quickly search and extract information from any type of document, improving everything from knowledge management to equipment maintenance.

Introduction to LangChain

LangChain is a framework for developing applications powered by large language models (LLMs). It seamlessly integrates these models with a variety of data sources, enhancing the development of sophisticated and efficient applications. LangChain is highly flexible, allowing customization to meet specific needs and is continuously updated for optimal performance. Its vibrant community provides robust support. With LangChain, users can ask complex questions in natural language, connect to diverse data sources beyond local documents, and build responsive chatbots and virtual assistants to streamline tasks and improve user interaction.

Practical Demonstrations (Coding Time):

After understanding the basics, let’s see how these technologies work in action. We’ll start with a simple code example that demonstrates how Chat with Document can be used to interact with a sample document.

- Step 1: Environment Set Up

First, we’ll set up our coding environment by mounting Google Drive and installing necessary libraries.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Install required libraries
!pip install openai langchain faiss-cpu docx2txt tiktoken

# Import necessary modules
from langchain.llms import OpenAI
from langchain_community.document_loaders import Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

- Step 2: Initialize the Language Model

Next, we’ll set up an LLM using the ChatGPT model from OpenAI, though this setup is adaptable to other models like those from Google. If you don’t already have an OpenAI API key, you can learn how to obtain one here.

# Set your OpenAI API key
openai_api_key = 'your-api-key-here'

# Initialize the language model
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

- Step 3: Loading Document

Figure 1: Simplify the LangChain Framework (Focus on Document Loading)

For ease of understanding, this demonstration focuses on Microsoft Word files stored in Google Cloud. It presents a modified version of the classic fable “The Tortoise and the Hare” for clear understanding. The story retains its core elements but incorporates unique details for a fresh perspective. Specifically, the setting is changed to “DC Forest,” and the characters’ names are altered to “Donato” (tortoise) and “Piti” (hare)

# Load document
loader = Docx2txtLoader('/content/drive/your-folder/Tortoise and the Hare.docx')
documents = loader.load()
print (f"You have {len(documents )} document")
print (f"You have {len(documents [0].page_content)} characters in that document")
# Output from above code
You have 1 document
You have 2559 characters in that document

The modified version of “The Tortoise and the Hare” is presented below

# Access and print text content
text_content = documents[0].page_content
print(text_content)
# Output from above code

Tortoise and the Hare

Deep within a sun-dappled clearing of the DC Forest, lived a hare named Piti, famed for his lightning speed. He would streak past the other animals, a blur of brown fur that left them breathless in his wake. One crisp morning, Piti was bragging about his agility, puffing out his chest and flicking his tail with unconcealed pride.

"There's no creature in this entire forest faster than me!" he declared, his voice echoing through the trees.

A slow, rumbling voice came from behind a nearby thicket. It was Donato, a tortoise known for his steady pace and unwavering determination.

"Speed isn't everything, Piti," Donato rumbled. "Even the slowest can achieve victory, if they set their mind to it."

Piti burst into laughter. "You? Win a race against me? Donato, that's the most ludicrous notion I've ever heard!"

To everyone's surprise, Donato challenged Piti to a race. The other animals gathered around, buzzing with excitement at the prospect of such an unequal competition. Even the wise old owl hooted in amusement, his amber eyes twinkling with anticipation.

The race began. Piti shot off like a furry bullet, leaving Donato in a cloud of dust. The animals cheered for the hare, certain of his victory. But Piti, brimming with overconfidence, spotted a patch of wildflowers bursting with color. He darted off the track, unable to resist the temptation of a tasty treat.

Meanwhile, Donato plodded on steadily, never stopping, never wavering. He may have been slow, but his determination burned bright.

Back on the track, Piti, feeling sluggish from his snack, decided to take a nap under the shade of a towering oak. "Old Donato won't catch up to me anyway," he thought arrogantly.

He drifted off to sleep, picturing himself crossing the finish line first to a chorus of cheers. But time crawled by for the sleeping hare, while for Donato, it marched on relentlessly.

Donato, inch by inch, made his way towards the finish line. The animals, who had initially mocked him, now cheered him on, their voices echoing through the forest. They were impressed by his unwavering perseverance.

Finally, Donato, with a triumphant plod, crossed the finish line. Piti woke up with a start, his ears twitching in disbelief. He saw, to his utter humiliation, the crowd celebrating Donato's victory.

The hare had lost the race, not to speed, but to slow and steady determination. The cheers of the animals resonated through the DC Forest, a testament to the fact that slow and steady truly does win the race.

Note: LangChain is not limited to processing documents from Google Cloud storage. It offers broad flexibility in handling various document types and data sources. We’ll explore this further in a future episode. For further details, please refer to the LangChain website (Document loaders).

- Step 4: Split Document into Smaller Chunks

To manage large texts more effectively, we split the document into manageable parts.

# Split document into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_documents = text_splitter.split_documents(documents)

For demonstration purposes, we’re using a chunk size of 1000 and an overlap of 100. Optimal values for these settings vary based on the specific requirements of your application and the nature of the documents. However, this demonstration utilizes these convenient values for clarity.

  • Chunk Size: The length (in characters) of your end chunks
  • Chunk Overlap: The amount of overlap or cross over sequential chunks share

Given that our document has 2559 characters (as observed in Step 3), splitting with a chunk size of 1000 and an overlap of 100 should yield three overlapping chunks. The first chunk includes characters 0–1000, the second from 900–1900 (note the overlap), and the third from 1800 to the end of the document.

Let’s validate this by calculating the average number of characters per chunk:

# Calculate average characters per document
total_characters = sum([len(doc.page_content) for doc in split_documents])
average_characters = total_characters / len(split_documents)
print(f"There are {len(split_documents)} documents with an average of {average_characters:,.0f} characters each.")
# Output from above code
There are 3 documents with an average of 850 characters each.

Note: For further details, please refer to the LangChain website (Text Splitters) and More clear about chuck size the chuck overlap, ChunkViz v0.1 website is very useful.

- Step 5: Set Up Embeddings Engine

Embeddings convert text into vector, making it possible for AI models to understand and process language. They capture semantic meanings and relationships between words. For further details, please refer to the LangChain website (Text embedding models)

# Set up embeddings engine
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Create a vector store for document search
doc_search = FAISS.from_documents(split_documents, embeddings)

- Step 6: Querying the Document

Now, let’s ask the AI some questions about the document to see how well it can retrieve information based on our setup.

# Initialize Retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=doc_search.as_retriever())

# Define and execute a query about the story
query = "Who won the race? What type of animal won? What is the name of the winner and why did they win?"
qa_result = qa_chain.run(query)
qa_result
# Output from above code
Donato, a tortoise, won the race because of his slow and steady determination.

Note:

Here are some key resources that I found helpful in learning about Chat with Document.

Thank you for joining me in this exploration of Chat with Document and LangChain technologies. Throughout this episode, we’ve seen how these tools can transform our interaction with data, making it more intuitive and accessible. We hope these demonstrations inspire you to consider how you might integrate these technologies into your own data management and analysis processes.

Data Science

26 stories

Dashboard

3 stories

Donato_Journey

5 stories

Course_Review

3 stories

Thank you for joining me in this exploration. Your engagement and experiences enrich our journey. Please feel free to share your thoughts, questions, or insights below, or connect with me on

Medium: medium.com/donato-story
Facebook:
web.facebook.com/DonatoStory
Linkedin:
linkedin.com/in/nattapong-thanngam

--

--

Donato_TH
Donato Story

Data Science Team Lead at Data Cafe, Project Manager (PMP #3563199), Black Belt-Lean Six Sigma certificate