LlamaIndex: Chunking Strategies for Large Language Models. Part — 1

BavalpreetSinghh
19 min readMar 3, 2024

--

In the previous blog, we delved into the intricacies of constructing and querying document indexes with Llama-Index. It’s important to note that when dealing with large input text documents, such as PDFs or .txt files, querying the indexes may yield subpar results. To address these performance issues, several factors can be controlled, one of which is the chunking or node creation process within Llama-Index. So, before diving into the details, let’s set the stage for further exploration.

Photo by Mae Mu on Unsplash

Now that we’ve delved into the building and querying process, a crucial component in developing RAG applications, you might have some questions. What exactly is RAG? What does retrieval mean in this context? And how does llama-index tackle the challenges we’ve discussed earlier?

Retrieval Augmented Generation (RAG) is a system that augments a Large Language Model (LLM) by adding extra context or information from another source.

Retrieval is the process of bringing extra information or context to your language models.

LLamaIndex addresses the challenges of scaling language models to large document collections. To overcome the challenge, LLamaIndex employs two key strategies. Firstly, it chunks documents into smaller contexts such as sentences or paragraphs, which are referred to as Nodes. These Nodes can be efficiently processed by language models. Secondly, LLamaIndex indexes these Nodes using vector embeddings, enabling fast and semantic search.

By chunking documents and leveraging vector embeddings, LLamaIndex enables scalable semantic search over large datasets(which we will discuss in detail in next blog along with other relevant techniques). It achieves this by retrieving relevant Nodes from the index and synthesizing responses using a language model. For this blog, we will only focus towards chunking or node curation.

Let’s begin with Text splitting/chunking/node curation, all are the same in theory. We cannot pass unlimited data to the application due to two main reasons:

1. Context limit: Language models have limited context windows.
2. Signal to noise ratio: Language models are more effective when the information provided is relevant to the task.

The aim is not to chunk for the sake of chunking but to get the data in a format where it can be used for anticipated tasks, and retrieved for value later. Rather than asking “How should I chunk my data?”, the actual question should be “What is the optimal way for me to pass data to my language model that it needs for its task?”

! pip install llama_index

Node Parser

Node parsers break down a list of documents into Node objects where each node represents a distinct chunk of the parent document, inheriting all attributes from the parent document to the children nodes.

Node Parsers — File-Based

To streamline node parsing, various file-based parsers are available, tailored to different content types like JSON or Markdown. The straightforward approach involves coupling the FlatFileReader with the SimpleFileNodeParser, which intelligently selects the appropriate parser for each content type. Additionally, you can augment this with a text-based parser to accurately handle text length.

NodeParser — SimpleFile

from llama_index.core.node_parser import SimpleFileNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path

md_docs = FlatReader().load_data(Path("/content/README (1).md"))

parser = SimpleFileNodeParser()
md_nodes = parser.get_nodes_from_documents(md_docs)
md_nodes[0]
# output

{
"id": "1bab03a5-2071-4ea7-ab31-2d6211aac74a",
"embedding": null,
"metadata": {
"Header 1": "Rasa Customer Service Bot",
"filename": "README (1).md",
"extension": ".md"
},
"excluded_embed_metadata_keys": [],
"excluded_llm_metadata_keys": [],
"relationships": {
"SOURCE": {
"node_id": "72319e2c-d36b-470d-9487-a59971ee19ca",
"node_type": "DOCUMENT",
"metadata": {
"filename": "README (1).md",
"extension": ".md"
},
"hash": "3d5f2703485b1f4903690fdb8a57085949b74cdfce9ac9729e483d0fec4831cf"
},
"NEXT": {
"node_id": "9edaadf9-425b-417f-b18a-bb1df7f73cfc",
"node_type": "TEXT",
"metadata": {
"Header 1": "Rasa Customer Service Bot",
"Header 2": "File Structure",
"filename": "README (1).md",
"extension": ".md"
},
"hash": "584ffef592cb306af05bcfcc93d66aa9fbd72078ccaa10de78032a4de3b490ff"
}
},
"text": "Rasa Customer Service Bot\n\nWelcome to the Rasa Customer Service Bot! This bot is designed to assist users from three different counties: Clay County, Utah, and West Hollywood. It provides customer service functionalities tailored to the needs and inquiries specific to each county.",
"start_char_idx": 2,
"end_char_idx": 283,
"text_template": "{metadata_str}\n\n{content}",
"metadata_template": "{key}: {value}",
"metadata_seperator": "\n"
}

NodeParser — HTML

This node parser utilizes Beautiful Soup to parse raw HTML content. By default, it parses a predefined set of HTML tags, but you have the option to customize this selection. The default tags include “p”, “h1” to “h6”, “li”, “b”, “i”, “u”, and “section”.

import requests
from llama_index.core import Document
from llama_index.core.node_parser import HTMLNodeParser

# URL of the website to fetch HTML from
url = "https://www.utoronto.ca/"

# Send a GET request to the URL
response = requests.get(url)
print(response)

# Check if the request was successful (status code 200)
if response.status_code == 200:
# Extract the HTML content from the response
html_doc = response.text

# Create a Document object with the HTML content
document = Document(id_=url, text=html_doc)

# Initialize the HTMLNodeParser with optional list of tags
parser = HTMLNodeParser(tags=["p", "h1"])

# Parse nodes from the HTML document
nodes = parser.get_nodes_from_documents([document])

# Print the parsed nodes
print(nodes)
else:
# Print an error message if the request was unsuccessful
print("Failed to fetch HTML content:", response.status_code)
# output
<Response [200]>
[TextNode(id_='4316c3a3-8a91-4e5f-aa7f-555237742254', embedding=None, metadata={'tag': 'h1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='b84e192f8243da374c83690615fad3543fa108588df3cb448376a932209fece1'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='0856479e-14cf-45c2-903c-7774deee317c', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'p'}, hash='6c1cbb06e3dc64a06ebd226cd7dd1960a8da0950feaf4e221bb997bc1cf46a26')}, text='Welcome to University of Toronto', start_char_idx=2784, end_char_idx=2816, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), TextNode(id_='0856479e-14cf-45c2-903c-7774deee317c', embedding=None, metadata={'tag': 'p'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://www.utoronto.ca/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='b84e192f8243da374c83690615fad3543fa108588df3cb448376a932209fece1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='4316c3a3-8a91-4e5f-aa7f-555237742254', node_type=<ObjectType.TEXT: '1'>, metadata={'tag': 'h1'}, hash='e1e6af749b6a40a4055c80ca6b821ed841f1d20972e878ca1881e508e4446c26')}, text='Five things to look forward to at Entrepreneurship Week 2024\nYour guide to the U of T community\nThe University of Toronto is home to some of the world’s top faculty, students, alumni and staff. U of T Celebrates recognizes their award-winning accomplishments.\nDavid Dyzenhaus recognized with Gold Medal from Social Sciences and Humanities Research Council\nOur latest issue is all about feeling good: the only diet you really need to know about, the science behind cold plunges, a uniquely modern way to quit smoking, the “sex, drugs and rock ‘n’ roll” of university classes, how to become a better workplace leader, and more.\nResearch and Ideas\nYou’ve decided you want to eat better. Now what?\nThere are countless diets to choose from, but one rises above the rest, say U of T nutrition experts\n\nStatement of Land Acknowledgement\nWe wish to acknowledge this land on which the University of Toronto operates. For thousands of years it has been the traditional land of the Huron-Wendat, the Seneca, and the Mississaugas of the Credit. Today, this meeting place is still the home to many Indigenous people from across Turtle Island and we are grateful to have the opportunity to work on this land.\nRead about U of T’s Statement of Land Acknowledgement.\nUNIVERSITY OF TORONTO - SINCE 1827', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

NodeParser — JSON

The JSONNodeParser parses JSON.

from llama_index.core.node_parser import JSONNodeParser

url = "https://housesigma.com/bkv2/api/search/address_v2/suggest"

payload = {"lang": "en_US", "province": "ON", "search_term": "Mississauga, ontario"}

headers = {
'Authorization': 'Bearer 20240127frk5hls1ba07nsb8idfdg577qa'
}

response = requests.post(url, headers=headers, data=payload)

if response.status_code == 200:
# Create a Document object with the JSON response
document = Document(id_=url, text=response.text)

# Initialize the JSONNodeParser
parser = JSONNodeParser()

# Parse nodes from the JSON document
nodes = parser.get_nodes_from_documents([document])

# Print the parsed nodes
print(nodes)
else:
print("Failed to fetch JSON content:", response.status_code)
# output
[TextNode(id_='272b5be3-052c-42da-a77f-66a3b6229804', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://housesigma.com/bkv2/api/search/address_v2/suggest', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='d1865a5da7c9cc78321e84704aa8226ec9278cb3db9c9488d2a2e0a0f006ff9a')}, text='status True\ndata house_list id_listing owJKR7PNnP9YXeLP\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.75M\ndata house_list price 749,000\ndata house_list price_sold 690,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71996\ndata house_list location lat 43.58322\ndata house_list addr 31 Ontario Crt\ndata house_list address 31 Ontario Crt\ndata house_list address_raw 31 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+1 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2015-03-16\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 31-ontario-crt\ndata house_list id_listing_history_v2 510QqypNo263LGlV\ndata house_list photo_url https://cache18.housesigma.com/Live/photos/FULL/1/115/W3129115.jpg?6cc85981\ndata house_list bedroom_string 4+1\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list id_listing kbEDRYarbz1y1VaB\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.63M\ndata house_list price 634,900\ndata house_list price_sold 629,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72032\ndata house_list location lat 43.58365\ndata house_list addr 28 Ontario Crt\ndata house_list address 28 Ontario Crt\ndata house_list address_raw 28 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+2 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2009-08-17\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 28-ontario-crt\ndata house_list id_listing_history_v2 6VLaGyGalLA3W1ZD\ndata house_list photo_url https://cache08.housesigma.com/Live/photos/FULL/1/637/W1641637.jpg?ab92e198\ndata house_list bedroom_string 4+2\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list id_listing 0J6Em7brmxL7XBeq\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.45M\ndata house_list price 449,000\ndata house_list price_sold 410,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71896\ndata house_list location lat 43.58448\ndata house_list addr 16 Ontario St W\ndata house_list address 16 Ontario St W\ndata house_list address_raw 16 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 2+1 Bedroom, 2 Bathroom, 0 Garage\ndata house_list date_preview 2013-04-29\ndata house_list ml_count_text Listed 3 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 16-ontario-st-w\ndata house_list id_listing_history_v2 MB5bO3xqdmpYkWVP\ndata house_list photo_url https://cache19.housesigma.com/Live/photos/FULL/1/534/W2608534.jpg?c46e9414\ndata house_list bedroom_string 2+1\ndata house_list washroom 2\ndata house_list garage 0\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing weQp5yOpz1V7d0ZE\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.61M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71846\ndata house_list location lat 43.58431\ndata house_list addr 11 Ontario St W\ndata house_list address_raw 11 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 2 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 11-ontario-st-w\ndata house_list id_listing_history_v2 ZNkKJ3J1x5Z7d4V6\ndata house_list photo_url https://cache06.housesigma.com/Live/photos/FULL/1/749/W3429749.jpg?e67c455d\ndata house_list province_abbr ON\ndata house_list id_listing xmZRW7ngVM13EBO9\ndata house_list house_type_in_map D\ndata house_list price_abbr 1M\ndata house_list price 999,990\ndata house_list price_sold 945,000\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.71835\ndata house_list location lat 43.58445\ndata house_list addr 9 Ontario St W\ndata house_list address 9 Ontario St W\ndata house_list address_raw 9 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+1 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2019-12-02\ndata house_list ml_count_text Listed 9 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 9-ontario-st-w\ndata house_list id_listing_history_v2 ZEXrx30pQXp3OklN\ndata house_list photo_url https://cache05.housesigma.com/Live/photos/FULL/1/760/W4548760.jpg?a2d2c969\ndata house_list bedroom_string 4+1\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing JKdOYrGzb9Zy54lW\ndata house_list house_type_in_map D\ndata house_list price_abbr 1.3M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72076\ndata house_list location lat 43.58335\ndata house_list addr 34 Ontario Crt\ndata house_list address_raw 34 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 3 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 34-ontario-crt\ndata house_list id_listing_history_v2 0J6Em7bLwDjYXBeq\ndata house_list photo_url https://cache17.housesigma.com/Live/photos/FULL/1/078/W5743078.jpg?17bcecbf\ndata house_list province_abbr ON\ndata house_list id_listing a6zqW7dmkmv35eZE\ndata house_list house_type_in_map D\ndata house_list price_abbr 1.4M\ndata house_list price 1,449,000\ndata house_list price_sold None\ndata house_list tags Terminated\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 0\ndata house_list list_status status TER\ndata house_list list_status text Terminated\ndata house_list location lon -79.721\ndata house_list location lat 43.58264\ndata house_list addr 45 Ontario Crt\ndata house_list address 45 Ontario Crt\ndata house_list address_raw 45 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status TER\ndata house_list rooms_text 5 Bedroom, 5 Bathroom, 2 Garage\ndata house_list date_preview 2015-07-06\ndata house_list ml_count_text Listed 5 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 45-ontario-crt\ndata house_list id_listing_history_v2 knbq6y1d9QPYo9DA\ndata house_list photo_url https://cache06.housesigma.com/Live/photos/FULL/1/263/W3198263.jpg?d6e729f3\ndata house_list bedroom_string 5\ndata house_list washroom 5\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing kbEDRYa8zpQ31VaB\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.75M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72076\ndata house_list location lat 43.58252\ndata house_list addr 43 Ontario Crt\ndata house_list address_raw 43 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 43-ontario-crt\ndata house_list id_listing_history_v2 VgAaOyLvZ1N3GxMb\ndata house_list photo_url https://cache-e13.housesigma.com/Live/photos/FULL/1/176/W2359176.jpg?67229221\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing BXeEn7XJREdYrPo8\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.47M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72122\ndata house_list location lat 43.58302\ndata house_list addr 40 Ontario Crt\ndata house_list address_raw 40 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 3 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 40-ontario-crt\ndata house_list id_listing_history_v2 jJKdOYr5ZVZ354lW\ndata house_list photo_url https://cache17.housesigma.com/Live/photos/FULL/1/174/W1629174.jpg?1fba5908\ndata house_list province_abbr ON\ndata house_list id_listing obqB176q16WyZajD\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.54M\ndata house_list price 539,900\ndata house_list price_sold 522,500\ndata house_list tags Sold\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72021\ndata house_list location lat 43.58301\ndata house_list addr 35 Ontario Crt\ndata house_list address 35 Ontario Crt\ndata house_list address_raw 35 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list rooms_text 4+1 Bedroom, 4 Bathroom, 2 Garage\ndata house_list date_preview 2010-10-21\ndata house_list ml_count_text Listed 2 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 35-ontario-crt\ndata house_list id_listing_history_v2 ZNkKJ3JaQrx7d4V6\ndata house_list photo_url https://cache09.housesigma.com/Live/photos/FULL/1/279/W1962279.jpg?5b470131\ndata house_list bedroom_string 4+1\ndata house_list washroom 4\ndata house_list garage 2\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing EeVbOYE14XGYx2P0\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.9M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.72117\ndata house_list location lat 43.58278\ndata house_list addr 42 Ontario Crt\ndata house_list address_raw 42 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 2 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 42-ontario-crt\ndata house_list id_listing_history_v2 JjAXw7Q4MKMyQOzg\ndata house_list photo_url https://cache16.housesigma.com/Live/photos/FULL/1/062/W2731062.jpg?193f3cce\ndata house_list province_abbr ON\ndata house_list id_listing mLzQ1y5dvvjYqdeK\ndata house_list house_type_in_map D\ndata house_list price_abbr 0.6M\ndata house_list price 599,000\ndata house_list price_sold None\ndata house_list tags Terminated\ndata house_list list_status public 1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 0\ndata house_list list_status status TER\ndata house_list list_status text Terminated\ndata house_list location lon -79.71883\ndata house_list location lat 43.58459\ndata house_list addr 12 Ontario St W\ndata house_list address 12 Ontario St W\ndata house_list address_raw 12 Ontario St W\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status TER\ndata house_list rooms_text 2 Bedroom, 3 Bathroom, 0 Garage\ndata house_list date_preview 2015-01-23\ndata house_list ml_count_text Listed 8 times\ndata house_list type_text Detached\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 12-ontario-st-w\ndata house_list id_listing_history_v2 b1DBW7R6K2Q7qlAp\ndata house_list photo_url https://cache-e14.housesigma.com/Live/photos/FULL/1/380/W3101380.jpg?e7dc9b16\ndata house_list bedroom_string 2\ndata house_list washroom 3\ndata house_list garage 0\ndata house_list province_abbr ON\ndata house_list address (Address display requires sign-in)\ndata house_list rooms_text x Bedroom, x Bathroom, x Garage\ndata house_list bedroom_string -\ndata house_list garage None\ndata house_list washroom None\ndata house_list date_preview (Sign-in required)\ndata house_list price xxx,xxx\ndata house_list price_sold xxx,xxx\ndata house_list id_listing oK8OgYBLo4z7JmG2\ndata house_list house_type_in_map V\ndata house_list price_abbr 0.35M\ndata house_list tags Sold\ndata house_list list_status public -1\ndata house_list list_status live 0\ndata house_list list_status s_r Sale\ndata house_list list_status sold 1\ndata house_list list_status status SLD\ndata house_list list_status text Sold\ndata house_list location lon -79.7208\ndata house_list location lat 43.58283\ndata house_list addr 46 Ontario Crt\ndata house_list address_raw 46 Ontario Crt\ndata house_list id_community 381\ndata house_list community_name Streetsville\ndata house_list id_municipality 10205\ndata house_list municipality_name Mississauga\ndata house_list s_r Sale\ndata house_list status SLD\ndata house_list ml_count_text Listed 1 times\ndata house_list type_text Vacant Land\ndata house_list seo_municipality mississauga-real-estate\ndata house_list seo_address 46-ontario-crt\ndata house_list id_listing_history_v2 02Zpj39nW9dYDrK8\ndata house_list photo_url https://cache18.housesigma.com/Live/photos/FULL/1/903/W2269903.jpg?5defed43\ndata house_list province_abbr ON\ndata place_list id owJKR7PNnP9YXeLP\ndata place_list text 31 Ontario Crt, Mississauga - Streetsville, ON\ndata place_list province_abbr ON\ndata place_list id_municipality 10205\ndata place_list seo_municipality mississauga-real-estate\ndata place_list lng -79.71996\ndata place_list lat 43.58322\ndata community_list municipality_name Red Rock Ontario\ndata community_list coordinate lon -88.2573624\ndata community_list coordinate lat 48.9421387\ndata community_list id_municipality 73087\ndata community_list community_name Red Rock Ontario\ndata community_list province_abbr ON\ndata community_list id_community 16245\ndata community_list seo_municipality red-rock-ontario-real-estate\ndata community_list municipality_name Thunder Bay, Ontario\ndata community_list coordinate lon -89.2625046\ndata community_list coordinate lat 48.3723145\ndata community_list id_municipality 81843\ndata community_list community_name Thunder Bay, Ontario\ndata community_list province_abbr ON\ndata community_list id_community 43959\ndata community_list seo_municipality thunder-bay-ontario-real-estate\ndata community_list municipality_name Longlac, Ontario\ndata community_list coordinate lon -86.5466461\ndata community_list coordinate lat 49.7723846\ndata community_list id_municipality 81674\ndata community_list community_name Longlac, Ontario\ndata community_list province_abbr ON\ndata community_list id_community 42851\ndata community_list seo_municipality longlac-ontario-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.7529144\ndata community_list coordinate lat 43.5792542\ndata community_list id_municipality 10205\ndata community_list community_name Mississauga\ndata community_list province_abbr ON\ndata community_list id_community 15057\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.6569591\ndata community_list coordinate lat 43.53322\ndata community_list id_municipality 10205\ndata community_list community_name Sheridan\ndata community_list province_abbr ON\ndata community_list id_community 385\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.612725\ndata community_list coordinate lat 43.621956\ndata community_list id_municipality 10205\ndata community_list community_name Rathwood\ndata community_list province_abbr ON\ndata community_list id_community 404\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name London\ndata community_list coordinate lon -81.24136\ndata community_list coordinate lat 42.9754\ndata community_list id_municipality 10176\ndata community_list community_name London Ontario\ndata community_list province_abbr ON\ndata community_list id_community 8866\ndata community_list seo_municipality london-real-estate\ndata community_list municipality_name London\ndata community_list coordinate lon -81.24184\ndata community_list coordinate lat 42.97614\ndata community_list id_municipality 10176\ndata community_list community_name London, Ontario\ndata community_list province_abbr ON\ndata community_list id_community 8868\ndata community_list seo_municipality london-real-estate\ndata community_list municipality_name Mississauga\ndata community_list coordinate lon -79.6213918\ndata community_list coordinate lat 43.5937139\ndata community_list id_municipality 10205\ndata community_list community_name Mississauga Valleys\ndata community_list province_abbr ON\ndata community_list id_community 398\ndata community_list seo_municipality mississauga-real-estate\ndata community_list municipality_name Elgin\ndata community_list coordinate lon -76.1232702\ndata community_list coordinate lat 44.6624265\ndata community_list id_municipality 12786\ndata community_list community_name Harlem Ontario Rideau Lakes\ndata community_list province_abbr ON\ndata community_list id_community 5889\ndata community_list seo_municipality elgin-real-estate\nerror code 0\nerror message \ndebug API v5.34.4\ndebug environment production\ndebug server_group ovh\ndebug server OVH01', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

NodeParser — Markdown

The MarkdownNodeParser parses raw markdown text.

from llama_index.core.node_parser import MarkdownNodeParser

md_docs = FlatReader().load_data(Path("/content/README (1).md"))
parser = MarkdownNodeParser()

nodes = parser.get_nodes_from_documents(md_docs)
nodes[0]
# output
TextNode(id_='02165e38-ed7d-4157-ad6f-af8fea5a4b2c', embedding=None, metadata={'Header 1': 'Rasa Customer Service Bot', 'filename': 'README (1).md', 'extension': '.md'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a7c31f47-74e7-4991-841f-864a681ce53f', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'README (1).md', 'extension': '.md'}, hash='3d5f2703485b1f4903690fdb8a57085949b74cdfce9ac9729e483d0fec4831cf'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='fbf91bad-d53d-43f7-b6e4-a2a5e07f6b34', node_type=<ObjectType.TEXT: '1'>, metadata={'Header 1': 'Rasa Customer Service Bot', 'Header 2': 'File Structure'}, hash='9250da70b13c424cdadb0346b78eabba52f085cc7e8d1856612fe7c6959381b9')}, text='Rasa Customer Service Bot\n\nWelcome to the Rasa Customer Service Bot! This bot is designed to assist users from three different counties: Clay County, Utah, and West Hollywood. It provides customer service functionalities tailored to the needs and inquiries specific to each county.', start_char_idx=2, end_char_idx=283, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

These various node parsers are designed to handle specific types of files, such as HTML, Markdown, and JSON. You can utilize them for specialized tasks or opt for the SimpleFileNodeParser, which is capable of automatically handling all file types. Give it a try!

Text-Splitters

CodeSplitter

Splits raw code-text based on the language it is written in. Check the full list of supported languages here.

from llama_index.core.node_parser import CodeSplitter
documents = FlatReader().load_data(Path("/content/mnist_utils.py"))
splitter = CodeSplitter(
language="python",
chunk_lines=40, # lines per chunk
chunk_lines_overlap=15, # lines overlap between chunks
max_chars=1500, # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0]
# output
TextNode(id_='85b26f19-39bb-45c1-99fa-477059f9f7f1', embedding=None, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9b63f7a5-3c7e-46f8-a684-b865dbfc1976', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, hash='8808ab4e838d76ca2c412ba4c90720ec67c23dea1e45434670022106b5bc254e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='4b8b8c05-abc6-41ea-a9b5-0d2116cd1fe7', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='333ebfda7feae905b19eee9a88674348a2a2cae61a182cbbda347d17c2571d4b')}, text='# mnist_utils.py\n\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nimport torchvision\nimport torchvision.transforms as transforms\nimport numpy as np\nimport matplotlib.pyplot as plt', start_char_idx=0, end_char_idx=194, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

SentenceSplitter

The SentenceSplitter attempts to split text while respecting the boundaries of sentences.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0]
#output
TextNode(id_='c4cb8b40-8130-4351-85d0-d570af4435da', embedding=None, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9b63f7a5-3c7e-46f8-a684-b865dbfc1976', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, hash='8808ab4e838d76ca2c412ba4c90720ec67c23dea1e45434670022106b5bc254e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='48cee59c-50da-4328-939f-ff94ad38e8ef', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='b2ce1941dd9ca73caa1e6d656d9ea1086acf519ceba4c069c3d6a7633539b950')}, text='# mnist_utils.py\n\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nimport torchvision\nimport torchvision.transforms as transforms\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef load_mnist(batch_size=64):\n """\n Load the MNIST dataset and create data loaders for training and testing.\n\n Parameters:\n - batch_size (int): The batch size for data loaders. Default is 64.\n\n Returns:\n - trainloader (torch.utils.data.DataLoader): Data loader for the training set.\n - testloader (torch.utils.data.DataLoader): Data loader for the test set.\n\n This function loads the MNIST dataset using torchvision.datasets.MNIST. It applies\n transforms to normalize the pixel values to the range [-1, 1]. It then creates data\n loaders for the training and test sets using torch.utils.data.DataLoader. The training\n data loader shuffles the data, while the test data loader does not shuffle the data.\n\n Example Usage:\n trainloader, testloader = load_mnist(batch_size=128)\n """\n transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])\n trainset = torchvision.datasets.MNIST(root=\'./data\', train=True, download=True, transform=transform)\n trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)\n testset = torchvision.datasets.MNIST(root=\'./data\', train=False, download=True, transform=transform)\n testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)\n return trainloader, testloader', start_char_idx=0, end_char_idx=1545, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

SentenceWindowNodeParser

The SentenceWindowNodeParser functions similarly to other node parsers, but with the distinction of splitting all documents into individual sentences. Each resulting node also includes the neighboring “window” of sentences surrounding it in the metadata. It’s important to note that this metadata won’t be accessible to the LLM or embedding model. This approach is particularly beneficial for generating embeddings with a highly specific scope. When used in conjunction with a a MetadataReplacementNodePostProcessor, you can replace the sentence with its surrounding context before sending the node to the LLM.

import nltk
from llama_index.core.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
# how many sentences on either side to capture
window_size=3,
# the metadata key that holds the window of surrounding sentences
window_metadata_key="window",
# the metadata key that holds the original sentence
original_text_metadata_key="original_sentence",
)

In upcoming blogs, I’ll provide a comprehensive example along with a thorough comparison. To maintain the blog’s length and ensure clarity, we won’t delve too deeply into intricate details. The aim is to offer a clear idea of the topic without overwhelming readers with excessive technicalities.

SemanticSplitterNodeParser

“Semantic chunking” introduces a novel approach where, instead of segmenting text with a predetermined chunk size, the semantic splitter dynamically selects breakpoints between sentences based on embedding similarity. This guarantees that each “chunk” comprises sentences that are semantically interconnected.

Considerations:

  • The regex is mainly optimized for English sentences.
  • Adjustments to the breakpoint percentile threshold may be necessary.

We will get into the details of it soon(I will update the link over here).

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

TokenTextSplitter

The TokenTextSplitter attempts to split to a consistent chunk size according to raw token counts.

from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
chunk_size=1024,
chunk_overlap=20,
separator=" ",
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0]
TextNode(id_='6747586b-52bd-489e-afb4-8f4fa21f11c3', embedding=None, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='9b63f7a5-3c7e-46f8-a684-b865dbfc1976', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'mnist_utils.py', 'extension': '.py'}, hash='8808ab4e838d76ca2c412ba4c90720ec67c23dea1e45434670022106b5bc254e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='81ee01e6-e91a-4f87-a56e-958d2618ef0c', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='c518236a47be8b69aa783b7bc77b4f59abee3d8f8eac35f5016979931366df23')}, text='# mnist_utils.py\n\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\nimport torchvision\nimport torchvision.transforms as transforms\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef load_mnist(batch_size=64):\n    """\n    Load the MNIST dataset and create data loaders for training and testing.\n\n    Parameters:\n    - batch_size (int): The batch size for data loaders. Default is 64.\n\n    Returns:\n    - trainloader (torch.utils.data.DataLoader): Data loader for the training set.\n    - testloader (torch.utils.data.DataLoader): Data loader for the test set.\n\n    This function loads the MNIST dataset using torchvision.datasets.MNIST. It applies\n    transforms to normalize the pixel values to the range [-1, 1]. It then creates data\n    loaders for the training and test sets using torch.utils.data.DataLoader. The training\n    data loader shuffles the data, while the test data loader does not shuffle the data.\n\n    Example Usage:\n    trainloader, testloader = load_mnist(batch_size=128)\n    """\n    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])\n    trainset = torchvision.datasets.MNIST(root=\'./data\', train=True, download=True, transform=transform)\n    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)\n    testset = torchvision.datasets.MNIST(root=\'./data\', train=False, download=True, transform=transform)\n    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)\n    return trainloader, testloader\n\n\ndef create_simple_nn():\n    """\n    Create a simple neural network model for image classification.\n\n    Returns:\n    - model (torch.nn.Module): The neural network model.\n\n    This function defines a simple neural network model for image classification tasks.\n    The model consists of a sequence of layers:\n    - Flatten layer: Reshapes the input image tensor into a 1D tensor.\n    - Fully connected (Linear) layer: Converts the flattened input into a hidden representation.\n      The input size is 28*28 (MNIST image size) and the output size is 128.\n    - ReLU activation function: Applies element-wise ReLU activation to introduce non-linearity.\n    - Fully connected (Linear) layer: Converts the hidden representation into class probabilities.\n      The input size is 128 (output of the previous layer) and the output size is 10 (number of classes).\n\n    Example Usage:\n    model = create_simple_nn()\n    """\n    model = nn.Sequential(\n        nn.Flatten(),\n        nn.Linear(28*28, 128),\n        nn.ReLU(),\n        nn.Linear(128, 10)\n    )\n    return model\n\ndef train_model(model, trainloader, optimizer, criterion, epochs=10):\n    """\n    Train a neural network model using the provided data loader, optimizer, and loss function.\n\n    Parameters:\n    - model (torch.nn.Module): The neural network model to train.\n    - trainloader (torch.utils.data.DataLoader): Data loader for the training dataset.\n    - optimizer (torch.optim.Optimizer): Optimizer to update the model parameters.\n    - criterion (torch.nn.Module): Loss function to compute the training loss.\n    - epochs (int): Number of epochs for training. Default is 10.\n\n    Returns:\n    - history (dict): Dictionary containing training history (loss and accuracy).\n\n    This function trains a neural network model using the provided data loader, optimizer, and\n    loss function. It iterates over the specified number of epochs and updates the model parameters\n    based on the training data. At each epoch, it computes the average loss and accuracy, and stores\n    them in a dictionary called \'history\'. The \'history\' dictionary contains two lists: \'loss\' and\n    \'accuracy\', which track the training loss and accuracy over epochs, respectively.\n\n    Example Usage:\n    history = train_model(model, trainloader, optimizer, criterion, epochs=10)\n    """\n    history = {\'loss\': [], \'accuracy\': []}\n    for epoch in range(epochs):\n        running_loss', start_char_idx=0, end_char_idx=3962, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

Relation-Based Node Parsers

HierarchicalNodeParser

This node parser divides nodes into hierarchical structures, resulting in multiple hierarchies of various chunk sizes from a single input. Each node includes a reference to its parent node.

When used alongside the AutoMergingRetriever, this facilitates automatic replacement of retrieved nodes with their parent nodes when a significant portion of children are retrieved. This mechanism enhances the context provided to the LLM for synthesizing responses. This concept warrants a dedicated blog post to craft a detailed example and explain each step thoroughly. I’ll delve into it further in upcoming blogs to provide a comprehensive understanding.

from llama_index.core.node_parser import HierarchicalNodeParser

node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)

Codebase for this blogpost — https://shorturl.at/jkswV

In conclusion, we’ve explored various types of node parsers in part 1 and will delve into specific ones in upcoming segments for clearer understanding and implementation details. Pinecone’s blog elucidates diverse chunking methods using Langchain, offering valuable insights for those learning chunking strategies. While it serves educational purposes well, but in industrial settings, llama-index is the preferred choice.

Stay tuned !

--

--

BavalpreetSinghh

Data Scientist at Tatras Data | Mentor @ Humber College | Consultant @ SL2