Chat with Document: A Closer Look at Splitting, Embeddings, and RAG

Data Mastery Series — Episode 21: The Chat with Document and Langchain Series (Part 2)

Donato_TH
Donato Story
5 min readMay 1, 2024

--

Connect with me and follow our journey: Linkedin, Facebook

In the first part of our series, we introduced the basics of Chat with Document and Langchain technologies. Now, we delve deeper into the technical aspects of document splitting, embeddings, similarity searches, and the role of Large Language Models (LLMs) in enhancing document interaction.
If you haven’t yet explored part one, we highly recommend starting there for foundational insights. You can find it here: “Chat with Document: Basics and Demonstrations.”

Understanding Document Splitting

Document splitting is crucial for efficiently managing extensive texts on digital platforms. It involves dividing large documents into smaller, manageable sections, essential for processing substantial text volumes and adhering to the token limits of language models such as OpenAI’s GPT-3.5 or Google’s Bison.

Key Considerations for Effective Document Splitting:

  • Document Structure: Ensure each split respects the document’s inherent structure, such as chapters or sections, to maintain content coherence. Avoid splitting through structured data like tables.
  • Chunk Size: Define the maximum size of each chunk, either in characters or tokens, considering the constraints of LLM model.
  • Chunk Overlap: Implement overlapping sections to ensure continuity and preserve context at boundaries.
  • Character Set Customization: While advanced splitters like the RecursiveCharacterTextSplitter allow for customization of the characters used for splitting (such as spaces, punctuation, and special characters)

For this demonstration, we are using a straightforward text splitter set to 300 characters with a 30-character overlap, without customization. This setup will help illustrate potential issues when text is not appropriately split, such as incomplete words or disrupted content flow.

Figure: Results of Simple Text Splitter (Image by Author)

Note: For more detailed information about Text Splitters, please refer to the Landchain: Text Splitters website.

Understanding Text Embeddings

Text embeddings convert written content into numerical vectors, enabling algorithms to process and analyze text. This capability underpins applications such as semantic search, which identifies texts with similar meanings within a vector space.

For this demonstration, we’ll use OpenAIEmbeddings. Below are the results of this embedding process, depicted in the first 10 vectors of the initial embedding, along with a table summarizing the text, embedding results, and embedding dimensions.

Figure: First 10 Vectors of the First Embedding and Summary Table (Image by Author)

For an in-depth exploration of text embeddings and the models used, please consult the LangChain documentation on “Text Embedding Models” and “Embedding Models.

Understanding Retrieval-Augmented Generation (RAG)

Having successfully completed the document loading, splitting, and embedding phases, and storing the data in a vector database, our focus shifts to Retrieval-Augmented Generation (RAG).

Figure: Ingestion Phase (Image by Google)

This technique enhances response generation by incorporating relevant context from the database.

1. Retrieval Process:

Retrieval is the first critical step in the RAG process. It involves identifying relevant sections of text that can answer a specific query.

  • Query Embedding: Converts a search query into a vector, using the same text embedding model as for document embedding, ensuring compatibility in vector space. To enhance clarity, I include both the text and the embedding result in the final row (row 10) of the summary table, as depicted in the figure below.
Figure: Updated Summary Table (Including Query in Last Row) (Image by Author)
  • Search for Similarity: We conduct a similarity search to identify the most relevant document chunks. For this demo, we set the top K to 3, retrieving chunks that best match the query context. The resulting top K includes Chunk numbers 4, 0, and 8.
Figure: Summary Table with Distance (Image by Author)

2. Augmentation Phase:

The augmentation involves constructing a prompt that combines the context from the retrieved chunks with the original query, aiming to generate a coherent response.

Example Code for Augmentation:

# Code
# question = "Who won the race? What type of animal won? What is the name of the winner and why did they win?"
prompt = f"Based on the context: {context} Answer the question: {question}"
print("Prompt for LLM:", prompt)
# Output
Prompt for LLM: Based on the context: nkling with anticipation.

The race began. Piti shot off like a furry bullet, leaving Donato in a cloud of dust. The animals cheered for the hare, certain of his victory. But Piti, brimming with overconfidence, spotted a patch of wildflowers bursting with color. He darted off the track, unable to Tortoise and the Hare

Deep within a sun-dappled clearing of the DC Forest, lived a hare named Piti, famed for his lightning speed. He would streak past the other animals, a blur of brown fur that left them breathless in his wake. One crisp morning, Piti was bragging about his agility, puffing out ato, with a triumphant plod, crossed the finish line. Piti woke up with a start, his ears twitching in disbelief. He saw, to his utter humiliation, the crowd celebrating Donato's victory.

The hare had lost the race, not to speed, but to slow and steady determination. The cheers of the animals res Answer the question: Who won the race? What type of animal won? What is the name of the winner and why did they win?

3. Generation Phase:

Here, the LLM model utilizes the augmented prompt to produce a response based on the provided context.

# Code
llm = OpenAI(model="gpt-3.5-turbo-instruct",
max_tokens=200,
temperature=0.7,
top_p=0.75,
openai_api_key=openai_api_key)

# Retrieve and print the response generated by the language model
response_text = llm(prompt).strip() # Extracting text from the response
print("LLM Response:", response_text)
# Output
LLM Response: Donato, the tortoise, won the race. He won because of his slow and steady determination, while Piti, the hare, lost due to his overconfidence and distraction.

Thank you for joining me in this deep dive into the detailed functionalities of Chat with Document and LangChain technologies. Throughout this series, we’ve explored the intricate processes of document splitting, text embeddings, and Retrieval-Augmented Generation, demonstrating how they enhance our ability to interact with. There are many aspects we haven’t yet covered, such as different embedding models, nuances of LLMs, temperature settings, and top_p configurations, which I plan to address in future episodes. I hope you’ll continue to follow this series.

Data Science

26 stories

Dashboard

3 stories

Donato_Journey

5 stories

Course_Review

3 stories

As we conclude this episode, your insights and feedback are invaluable in enriching our discussion and journey. Please feel free to share your thoughts, questions, or insights below, or connect with me on

Medium: medium.com/donato-story
Facebook:
web.facebook.com/DonatoStory
Linkedin:
linkedin.com/in/nattapong-thanngam

--

--

Donato_TH
Donato Story

Data Science Team Lead at Data Cafe, Project Manager (PMP #3563199), Black Belt-Lean Six Sigma certificate