Mastering RAG: A Deep Dive into Text Splitting

Shravan Kumar
8 min readAug 18, 2024

--

Here we will try to master with a complete deep dive on different topics which are essesntial for succesful RAG implementation. Here is the sample RAG architecture.

Credit: Dipanjan

Let us start with a concept called Text Splitting.

Here we might need to consider the input data is loaded and transformed into different chunks based on certain type of text split methods. Let us start of with different Text splitting methods and compare different vector stores for all these methods.

At a high level, text splitters work as following:

  1. Split the text up into small, semantically meaningful chunks (often sentences).
  2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
  3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

  1. How the text is split
  2. How the chunk size is measured

Practical code example with RAG

Import Libraries:

import os

from langchain.text_splitter import (
CharacterTextSplitter,
RecursiveCharacterTextSplitter,
SentenceTransformersTokenTextSplitter,
TextSplitter,
TokenTextSplitter,
)
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
  • This section imports the necessary modules and classes for the script. These include text splitters, document loaders, vector stores, and embeddings. The langchain_community and langchain_openai libraries are used to load documents, split them into manageable chunks, and create embeddings.

Directory Setup:

# Define the directory containing the text file
current_dir = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_dir, "books", "romeo_and_juliet.txt")
db_dir = os.path.join(current_dir, "db")
  • The code defines the current directory of the script, the path to the text file (romeo_and_juliet.txt), and the directory where the vector stores will be saved (db directory).

File Existence Check:

# Check if the text file exists
if not os.path.exists(file_path):
raise FileNotFoundError(
f"The file {file_path} does not exist. Please check the path."
)
  • The code checks if the specified text file exists. If not, it raises a FileNotFoundError, which stops the script and provides an error message.

Loading Text Content:

# Read the text content from the file
loader = TextLoader(file_path)
documents = loader.load()
  • The TextLoader class is used to load the content of the text file into the script. This content is stored in the documents variable.

Define Embedding Model:

# Define the embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small"
) # Update to a valid embedding model if needed
  • The code sets up the embedding model, which converts the text data into numerical vectors. This specific model (text-embedding-3-small) is used for embedding the text data, and it can be updated if needed.

Function to Create and Persist Vector Store:

# Function to create and persist vector store
def create_vector_store(docs, store_name):
persistent_directory = os.path.join(db_dir, store_name)
if not os.path.exists(persistent_directory):
print(f"\n--- Creating vector store {store_name} ---")
db = Chroma.from_documents(
docs, embeddings, persist_directory=persistent_directory
)
print(f"--- Finished creating vector store {store_name} ---")
else:
print(
f"Vector store {store_name} already exists. No need to initialize.")
  • This function checks if a vector store already exists in the specified directory. If not, it creates and persists a new vector store using the provided documents and embeddings. The vector store is saved in the db_dir directory with the name provided by store_name.

Now the most important thing for this article is on how we are going to split the input corpus data. Here we will look into different ways we can split the data and also check their outputs.

  1. Character-based Splitting:
# 1. Character-based Splitting
# Splits text into chunks based on a specified number of characters.
# Useful for consistent chunk sizes regardless of content structure.
print("\n--- Using Character-based Splitting ---")
char_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
char_docs = char_splitter.split_documents(documents)
create_vector_store(char_docs, "chroma_db_char")

The text is split into chunks of 1000 characters, with a 100-character overlap between chunks. This method ensures consistent chunk sizes regardless of the text’s content structure. The resulting chunks are stored in a vector store named chroma_db_char.

2. Sentence-based Splitting:

# 2. Sentence-based Splitting
# Splits text into chunks based on sentences, ensuring chunks end at sentence boundaries.
# Ideal for maintaining semantic coherence within chunks.
print("\n--- Using Sentence-based Splitting ---")
sent_splitter = SentenceTransformersTokenTextSplitter(chunk_size=1000)
sent_docs = sent_splitter.split_documents(documents)
create_vector_store(sent_docs, "chroma_db_sent")

The text is split into chunks based on sentences, with each chunk containing up to 1000 characters. This method ensures that the chunks maintain semantic coherence. The resulting chunks are stored in a vector store named chroma_db_sent.

3. Token-based Splitting:

# 3. Token-based Splitting
# Splits text into chunks based on tokens (words or subwords), using tokenizers like GPT-2.
# Useful for transformer models with strict token limits.
print("\n--- Using Token-based Splitting ---")
token_splitter = TokenTextSplitter(chunk_overlap=0, chunk_size=512)
token_docs = token_splitter.split_documents(documents)
create_vector_store(token_docs, "chroma_db_token")

The text is split into chunks based on tokens (such as words or subwords). This method is particularly useful when working with transformer models that have strict token limits. The resulting chunks are stored in a vector store named chroma_db_token.

4. Recursive Character-based Splitting:

# 4. Recursive Character-based Splitting
# Attempts to split text at natural boundaries (sentences, paragraphs) within character limit.
# Balances between maintaining coherence and adhering to character limits.
print("\n--- Using Recursive Character-based Splitting ---")
rec_char_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=100)
rec_char_docs = rec_char_splitter.split_documents(documents)
create_vector_store(rec_char_docs, "chroma_db_rec_char")

This method attempts to split the text at natural boundaries (such as sentences or paragraphs) while adhering to a character limit. It balances maintaining coherence with keeping the chunks within the specified size. The resulting chunks are stored in a vector store named chroma_db_rec_char.

5. Custom Splitting:

# 5. Custom Splitting
# Allows creating custom splitting logic based on specific requirements.
# Useful for documents with unique structure that standard splitters can't handle.
print("\n--- Using Custom Splitting ---")


class CustomTextSplitter(TextSplitter):
def split_text(self, text):
# Custom logic for splitting text
return text.split("\n\n") # Example: split by paragraphs


custom_splitter = CustomTextSplitter()
custom_docs = custom_splitter.split_documents(documents)
create_vector_store(custom_docs, "chroma_db_custom")

This section defines a custom text splitter that splits the text based on specific requirements (in this case, by paragraphs). This approach is useful for documents with unique structures that standard splitters may not handle well. The resulting chunks are stored in a vector store named chroma_db_custom.

Query Vector Store:

# Function to query a vector store
def query_vector_store(store_name, query):
persistent_directory = os.path.join(db_dir, store_name)
if os.path.exists(persistent_directory):
print(f"\n--- Querying the Vector Store {store_name} ---")
db = Chroma(
persist_directory=persistent_directory, embedding_function=embeddings
)
retriever = db.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 1, "score_threshold": 0.1},
)
relevant_docs = retriever.invoke(query)
# Display the relevant results with metadata
print(f"\n--- Relevant Documents for {store_name} ---")
for i, doc in enumerate(relevant_docs, 1):
print(f"Document {i}:\n{doc.page_content}\n")
if doc.metadata:
print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
else:
print(f"Vector store {store_name} does not exist.")

This function queries a specific vector store with a user-defined query. It checks if the vector store exists, retrieves relevant documents based on the query, and then displays the results along with any metadata.

Query Execution:

# Define the user's question
query = "How did Juliet die?"

# Query each vector store
query_vector_store("chroma_db_char", query)
query_vector_store("chroma_db_sent", query)
query_vector_store("chroma_db_token", query)
query_vector_store("chroma_db_rec_char", query)
query_vector_store("chroma_db_custom", query)

The code defines a query ("How did Juliet die?") and uses it to query each of the previously created vector stores. It calls the query_vector_store function for each store, retrieving and displaying the relevant documents.

Here is the complete code for your reference.

import os

from langchain.text_splitter import (
CharacterTextSplitter,
RecursiveCharacterTextSplitter,
SentenceTransformersTokenTextSplitter,
TextSplitter,
TokenTextSplitter,
)
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Define the directory containing the text file
current_dir = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_dir, "books", "romeo_and_juliet.txt")
db_dir = os.path.join(current_dir, "db")

# Check if the text file exists
if not os.path.exists(file_path):
raise FileNotFoundError(
f"The file {file_path} does not exist. Please check the path."
)

# Read the text content from the file
loader = TextLoader(file_path)
documents = loader.load()

# Define the embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small"
) # Update to a valid embedding model if needed


# Function to create and persist vector store
def create_vector_store(docs, store_name):
persistent_directory = os.path.join(db_dir, store_name)
if not os.path.exists(persistent_directory):
print(f"\n--- Creating vector store {store_name} ---")
db = Chroma.from_documents(
docs, embeddings, persist_directory=persistent_directory
)
print(f"--- Finished creating vector store {store_name} ---")
else:
print(
f"Vector store {store_name} already exists. No need to initialize.")


# 1. Character-based Splitting
# Splits text into chunks based on a specified number of characters.
# Useful for consistent chunk sizes regardless of content structure.
print("\n--- Using Character-based Splitting ---")
char_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
char_docs = char_splitter.split_documents(documents)
create_vector_store(char_docs, "chroma_db_char")

# 2. Sentence-based Splitting
# Splits text into chunks based on sentences, ensuring chunks end at sentence boundaries.
# Ideal for maintaining semantic coherence within chunks.
print("\n--- Using Sentence-based Splitting ---")
sent_splitter = SentenceTransformersTokenTextSplitter(chunk_size=1000)
sent_docs = sent_splitter.split_documents(documents)
create_vector_store(sent_docs, "chroma_db_sent")

# 3. Token-based Splitting
# Splits text into chunks based on tokens (words or subwords), using tokenizers like GPT-2.
# Useful for transformer models with strict token limits.
print("\n--- Using Token-based Splitting ---")
token_splitter = TokenTextSplitter(chunk_overlap=0, chunk_size=512)
token_docs = token_splitter.split_documents(documents)
create_vector_store(token_docs, "chroma_db_token")

# 4. Recursive Character-based Splitting
# Attempts to split text at natural boundaries (sentences, paragraphs) within character limit.
# Balances between maintaining coherence and adhering to character limits.
print("\n--- Using Recursive Character-based Splitting ---")
rec_char_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=100)
rec_char_docs = rec_char_splitter.split_documents(documents)
create_vector_store(rec_char_docs, "chroma_db_rec_char")

# 5. Custom Splitting
# Allows creating custom splitting logic based on specific requirements.
# Useful for documents with unique structure that standard splitters can't handle.
print("\n--- Using Custom Splitting ---")


class CustomTextSplitter(TextSplitter):
def split_text(self, text):
# Custom logic for splitting text
return text.split("\n\n") # Example: split by paragraphs


custom_splitter = CustomTextSplitter()
custom_docs = custom_splitter.split_documents(documents)
create_vector_store(custom_docs, "chroma_db_custom")


# Function to query a vector store
def query_vector_store(store_name, query):
persistent_directory = os.path.join(db_dir, store_name)
if os.path.exists(persistent_directory):
print(f"\n--- Querying the Vector Store {store_name} ---")
db = Chroma(
persist_directory=persistent_directory, embedding_function=embeddings
)
retriever = db.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"k": 1, "score_threshold": 0.1},
)
relevant_docs = retriever.invoke(query)
# Display the relevant results with metadata
print(f"\n--- Relevant Documents for {store_name} ---")
for i, doc in enumerate(relevant_docs, 1):
print(f"Document {i}:\n{doc.page_content}\n")
if doc.metadata:
print(f"Source: {doc.metadata.get('source', 'Unknown')}\n")
else:
print(f"Vector store {store_name} does not exist.")


# Define the user's question
query = "How did Juliet die?"

# Query each vector store
query_vector_store("chroma_db_char", query)
query_vector_store("chroma_db_sent", query)
query_vector_store("chroma_db_token", query)
query_vector_store("chroma_db_rec_char", query)
query_vector_store("chroma_db_custom", query)

The outputs for some of the queries:

This code provides a comprehensive approach to document processing, splitting, vectorization, and querying. It allows for experimentation with different text splitting methods and demonstrates how to store and query text embeddings using the Chroma vector store.

References:

  1. https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/?source=post_page-----3225af44e7ea--------------------------------

2.https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/?source=post_page-----3225af44e7ea--------------------------------

3. https://python.langchain.com/v0.2/docs/integrations/chat/

4. https://brandonhancock.io/langchain-master-class

5. https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

--

--

Shravan Kumar

Indian | AI Leader | Associate Director @ Novartis | Alumnus, IIT Madras & IIM Bangalore Follow me for more on AI, Data Science