Langchain's Character Text Splitter - In-Depth Explanation

Krishna Hariharan
6 min readApr 24, 2024

--

We live in a time where we tend to use a LLM based application in one way or the other, even without realizing it. Large Language Models which boomed last year, thanks to OpenAI’s ChatGPT, has opened the doors of many entrepreneurs and startups.

But if you are a techie, you must be knowing that, just a Large Language Model won’t serve the purpose of making an awesome application. It’s often the knowledge base on which LLM has been tuned to, that gives the extra edge for any applications.

Since we are talking about LLMs here, it is understandable that the knowledge base for any LLM is going to be texts. Texts do have various forms like different languages, different structure or even different programming language. But all of these do fall under a category of texts and what makes the LLM serve your purpose is the kind of texts / contexts of the texts on which you tune your LLM. Example: Harry Potter Series Book or (to spoil your mood) text book on Data Structures & Algorithms.

Fig 1 — Dense Text Books

The reason I take examples of Harry Potter books or DSA is to make you imagine the volume or the density of texts available in these. It’s so much that any LLM will fail to take them all together as their knowledge base.

Why LLM can’t handle dense texts?

If you have been through OpenAI’s official website or any other language model developer’s website, you must have gone through a term called as context window and mostly they are measured in tokens.

Context window is the maximum size of the context that you can feed an LLM. Basically, every LLM will have an context window beyond which it wont be able to handle.

Fig 2 — OpenAI Context Window Limit

What if I need LLM tuned on Huge Text Data?

This is exactly where chunking comes to rescue. Chunking is the process of breaking down the humungous text to small chunks of texts, so that it could be fed easily as an when needed to a LLM.

There are various chunking strategies available and each strategies has their pros and cons. As a GenAI engineer, once document is loaded, you want to transform the texts in the document according to your application and one should at least know which chunking strategy would serve their purpose, as chunking plays an important role in deciding the performance / accuracy of the LLM.

Today let’s dive deep into one of the commonly used chunking strategy i.e Character Text Splitter from Langchain.

Character Text Splitter:

As the name explains itself, here in Character Text Splitter, the chunks are segregated based on a specific character. It can a full stop, it can be a new line or any characters.

Syntax:

CharacterTextSplitter(
separator = ".",
chunk_size= 2,
chunk_overlap = 1,
length_function = len
)

Separator: Separator is the parameter using which one can decide which character could be used for chunking the texts. As said earlier, commonly used separators are new line(\n) or even a full stop (‘.’). The chunks could be split on various characters, based on the application.

It is also important to note that, Character Text Splitter doesn’t discriminate in-between a white space, comma, full stop or a single word. They all are just a character for them and it doesn’t really care until you want it to make chunks based on a specific character .

Let’s say, we have separator to be a “.” and in between commas and quotations do come in between as a part of the text, the Character Text Splitter doesn’t really care, they count them as character and doesn’t split the chunk until next “.” is met.

Chunk Size: They describe the maximum size a single piece of chunk can be. If a chunk goes beyond a limit, the langchain will will throw a warning as well.

Chunk Overlap: This determines the amount of overlap that should exist between two consecutive chunks. This is important because, if there is no overlap between continuous chunks, there are good chances that it might loose out on the context.

ChunkViz.com

To understand it better, I made up a text string and pasted in chunkviz.com and set the chunk size to be 25 and chunk overlap to be 5. As you can see in the image above, 29 chunks has been created and each chunks are represented by unique colors. The green colors highlighting the colors in-between the chunks are the overlaps. As the Chunk overlap given is 5, there is an overlap of 5 characters between two chunks. Thanks to chunkviz.com which made us visually understand chunk size and chunk overlap.

Length Function: They take the function of how the length of the chunk needs to be calculated. Commonly used function is length and based on the application, one can write their own script and pass that function to this parameter based on which the chunk size is measured.

Is Separator Regex: This is not a commonly used parameter but incase if you are dealing with some raw texts that has some regex like brackets, and you need to use regex as the separator, then this parameter needs to be set to True else it should be False.

Example:

Now let’s play around with an example so that we can understand better.

This is going to be our input text, with which we are going to be playing around.

texts = """This is exactly where chunking comes to rescue. 
Chunking is the process of breaking down the humungous text in small chunks of texts so that it could be fed easily as an when needed to a LLM.
There are various chunking strategies available and each strategies has their pros and cons.
As a GenAI engineer, once document is loaded, you want to transform the texts in the document according to your application and one should at least know which chunking strategy would serve their purpose, as chunking plays an important role in deciding the performance / accuracy of the LLM."
text_splitter = CharacterTextSplitter(
separator = ".",
chunk_size= 2,
chunk_overlap = 1,
length_function = len
)
text = text_splitter.split_text(texts)

Now text will be the list which will be having all the chunks. We can know the number chunks formed, by measuring the length of the list. In this case there are four sentences that are separated by a full stop. Hence, the number of chunks should be 4.

Let’s see what is there in the first chunk. Ideally, it should be the first sentence of the ‘texts’, since the separator is given as full stop.

In conclusion, the Character Text Splitter stands as a formidable ally in the realm of text chunking strategies, offering a dynamic approach to breaking down large textual datasets into manageable chunks. With its ability to segment text at the character level and its customizable parameters for chunk size and overlap, this splitter ensures that contextual integrity is preserved while optimizing the efficiency of text processing tasks.

Pros:

Easy & simple to use to use

Cons:

The structure of the document is not taken into consideration

Reference:

Fig 1 Image Source: https://www.formaxprinting.com/

Fig 2 https://openai.com/pricing

Fig 4 chunkviz.com

--

--