Getting Started: Chunking Strategy

Chris McKenzie
4 min readNov 7, 2023

--

Optimize Vector Search with the right Chunking Strategy

Imagine you’re developing an app which helps users search for specific episodes of “The Office” using text queries. You start by downloading transcripts of all the episodes of The Office, but how do you make a useful search? One method is to convert the transcript into embeddings and perform a distance search against the query.

If you decide to go down this path, you’re going to need to have a strategy for how you chunk the data.

Naively, you might think you can grab your favorite vector database — let’s say ChromaDB — and simply add each episode’s transcript as a document. However, you’re likely to encounter two issues:

  • 1) the transcript may exceed the token limit supported by the embedding model.
  • 2) adding the entire transcript as one document may lose important context, making accurate searches challenging.

What is chunking?

Chunking — as the name implies — is the process of breaking the document into smaller, semantically meaningful pieces. This is useful as it allows you to break larger documents into more meaningful subdocuments, thus preserving their context better. Additionally, chunking helps in creating embeddings for documents that are too long for the embedding model’s token limit.

Looking at The Office example, a good strategy might be chunking the transcript into scenes. This way, the embedding can capture the context of the scene, which is likely what a user will search for.

We could take it further and chunk by each character’s lines, which may or may not yield better results. This could be too fine-grained, potentially losing the conversation’s context. The optimal strategy will vary based on the use case.

How to chunk?

It’s important to understand your use case before landing on a chunking strategy. Things you’ll want to consider:

  • What is your content? Are you looking to index books, articles, or shorter content like product descriptions or social media posts?
  • What embedding model are you using? Different models have varying maximum token lengths and distinct “sweet spots” for chunk sizes.
  • How do you plan to use the retrieved results in your application? Are they for semantic search, question answering, summarization, or something else?
  • How will your users query your data? You should consider how users might query the data too. The goal is to create as close of a match between the query and the result as possible.

Chunking Strategies

The following code examples use langchain.js to create the document chunks. While I do recommend using the library, it isn’t required. The following code examples illustrate the different strategies.

Fixed-size chunking

Fixed-size chunking is a simple and easy way to divide a file into equally-sized segments. This is an ideal strategy in most common cases. You just need to decide the length of the chunk and if there should be overlap. Generally, you’ll want overlap to make sure the semantic context doesn’t get lost.

import { CharacterTextSplitter } from "langchain/text_splitter";

const text = "...";

const textSplitter = CharacterTextSplitter({
chunkSize: 256,
chunkOverlap: 20
});

const docs = textSplitter.createDocuments([ text ]);

“Content-aware” Chunking

Content-aware chunking is a collection of chunking methods that use the nature of the context to apply a more precise chunking strategy.

Sentence Splitting

The Python implementation ships with built in sentence splitting. Unfortunately, langchain.js does not yet have this feature. However, we can easily implement our own sentence splitter. This is a naive implementation, but it’s a good starting point. We can extend the `TextSplitter` class to create our own splitter.

import { TextSplitter } from "langchain/text_splitter";

class SentenceTextSplitter extends TextSplitter {
async splitText(text) {
// First we naively split the large input into a bunch of smaller ones,
// based on a regex that looks for common sentence endings
return this.splitOnSeparator(text, /(?<=[.!?])\s+/);
}
}

If you want a more robust implementation, you can use libraries like, natural to handle the sentence splitting.

Recursive Chunking

It splits the documents by the separators (default, “\n\n”, “\n”, “ “) recursively keeping as much of the semantically relevant context as possible.

The important options are `chunkSize` and `chunkOverlap`:

  • `chunkSize` is the maximum size of the chunk (default: 1000). If the chunk is larger than the `chunkSize`, it will be split into smaller chunks recursively.
  • `chunkOverlap` is the number of characters that will be shared between the chunks (default: 200). This is useful for preserving context.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = "...";

const textSplitter = RecursiveCharacterTextSplitter({
chunkSize: 256,
chunkOverlap: 20
});

const docs = textSplitter.createDocuments([ text ]);

Specialized Chunking

Another context-aware strategy is to use the underlying format to chunk the text. This is useful for formats that have a clear structure, like Markdown or LaTex.

Markdown

import { MarkdownTextSplitter } from "langchain/text_splitter";

const text = "...";

const textSplitter = MarkdownTextSplitter({
chunkSize: 100,
chunkOverlap: 0
});

const docs = textSplitter.createDocuments([ text ]);

LaTex

import { LatexTextSplitter } from "langchain/text_splitter";

const text = "...";

const textSplitter = LatexTextSplitter({
chunkSize: 100,
chunkOverlap: 0
});

const docs = textSplitter.createDocuments([ text ]);

Final Thoughts

There is no one size fits all strategy. The best strategy will depend on the content, the user’s use case, and the embedding model. The best way to find the right strategy is to experiment with different strategies and see what works best for your use case.

I created a playground — it’s a good starting point for experimenting with different strategies:

--

--