Retrieval in LangChain: Part 2— Text Splitters

Sushmitha
4 min readMar 17, 2024

--

Photo by Patrick Tomasso on Unsplash

Welcome to the second article of the series, where we explore the various elements of the retrieval module of LangChain. In the first article, we learned what is RAG, its framework, how RAG works and the first step involved in the process of retrieval — Document Loaders. Let’s get started with the second step which is Text Splitters — Slice like a pro.

Any NLP task involving a long document might need to be preprocessed or transformed to improve the accuracy or efficiency of the task at hand. Text Splitter comes in handy when it comes to breaking down huge documents into chunks that will enable analysis at a more granular level. LangChain provides the user with various options to transform the documents by chunking them into meaningful portions and then combining the smaller chunks into larger chunks of a particular size with overlap to retain the context. So the focus will be on how the text is split and how the chunk size is measured.

Types of Text Splitters:

  1. Character Text Splitter: This is the simplest method of splitting the text by characters which is computationally cheap and doesn't require the use of any NLP libraries. Here the text split is done on characters and the chunk size is measured by the number of characters.
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
chunk_size=1000,
separator="\n\n",
chunk_overlap=100,
length_function=len,
is_separator_regex=False
)

texts = text_splitter.create_documents(texts)

The parameter chunk_overlap helps in retaining the semantic context between the chunks. The metadata can also be passed along with the documents.

2. NLTK Text Splitter: When we want to focus more on the nature of the context, we might end up using sentence chunking. The basic approach to sentence chunking is text.split(“.”). However, LangChain has a better approach that uses NLTK tokenizers to perform text splitting. Here the text split is done on NLTK tokens and the chunk size is measured by the number of characters.

from langchain.text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000, chunk_overlap = 100)

texts = text_splitter.split_text(text)

3. Spacy Text Splitter: Another alternative to NLTK is the spaCy tokenizer which offers a more sophisticated sentence segmentation feature that separates the texts into chunks while preserving the context in a better way. Here the text split is done on spaCy tokens and the chunk size is measured by the number of characters.

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpaCyTextSplitter(chunk_size=1000, chunk_overlap = 100)

texts = text_splitter.split_text(text)

4. Recursive Character Text Splitter: This type of text splitter comes into the picture when the text exceeds the chunk length and there is no separator to chunk the text. This method uses a set of separators like [“\n\n”, “\n”, “ “, “”] and splits the text into smaller chunks in an iterative manner until the desired chunk is achieved. The resulting chunks will not be of the same size yet of similar sizes. Here the text split is done on the list of characters and the chunk size is measured by the number of characters.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", ' ', ''],
chunk_size = 1200,
chunk_overlap = 100,
length_function = len_fun,
is_separator_regex =False
)

chunk_list = text_splitter.create_documents(texts)

Note: The split first happens at “\n\n”, if the chunk size exceeds, it will move to the next separator, if it still exceeds, it will move to the next separator which is “ “ and so on.

5. Splitting on tokens: Handling token limits in language models is pivotal for seamless operations and optimal performance. It is a good practice to count the number of tokens after getting the chunks.

Tiktoken was created by OpenAI and is a fast BPE. It can be used to track the number of tokens used and better suits OpenAI models. Here the text split is done on the characters passed in and the chunk size is measured by the tiktoken tokenizer.

pip install tiktoken
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
separator="\n\n",
chunk_size=1200,
chunk_overlap=100,
is_separator_regex=False,
model_name='text-embedding-3-small',
encoding_name='text-embedding-3-small',
)

doc_list = text_splitter.create_documents([text])

doc_list

The model_name refers to the model used for calculating the tokens. The split text can also be converted to a list of documents.

from langchain.docstore.document import Document

doc_list = []

for line in line_list:
curr_doc = Document(page_content=line, metadata={"source": filepath})
doc_list.append(curr_doc)

doc_list

6. Sentence Transformers Token Text Splitter: This type is a specialized text splitter used with sentence transformer models.

from langchain.text_splitters import SentenceTransformersTokenTextSplitter  

splitter = SentenceTransformersTokenTextSplitter(
tokens_per_chunk=64,
chunk_overlap=0,
model_name='intfloat/e5-base-v2')


text_token_count = splitter.split_text(text=text)
print(text_token_count)

7. Code Splitter: This type lets you split the code and it comes with multiple language options like Python, java, Latex, HTML, scala, c, and a lot more.

from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_language(
language = Language.PYTHON,
chunk_size=50,
chunk_overlap =10
)

text_splitter.create_documents(texts = [python_code])

We can even find the type of separator used for the given language.

RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

And that’s not done there yet. Langchain has a lot of options to perform semantic chunking, token splitters for KoNLPY, Hugging Face, and more.

That’s all about text splitters. Experimenting with a range of chunk sizes will help in bringing a balance between preserving context and maintaining accuracy. Keep in mind that one solution will not work always, so get started with experimentation.

Now that we know how to chunk the text, let's see how we can embed the chunks in our next article. Thanks for reading.

References:

1.https://python.langchain.com/docs/modules/data_connection/document_transformers/

2.https://www.pinecone.io/learn/chunking-strategies/

--

--