LlamaIndex: Enhancing Context with Metadata Replacement and Sentence Window Node Parser

11 min readMar 4, 2024

In our previous blog post, we explored various node parsers available in llamaindex. In this blog post, we will delve into one specific parser in greater detail: the SentenceWindowNodeParser. Additionally, we will explore strategies to maximize its utility.

SentenceWindowNodeParser

This component is responsible for parsing documents into individual sentences. It creates nodes for each sentence, and each node includes a “window” containing the sentences surrounding it. This means that instead of just having one isolated sentence, you have a context window of sentences around it. During the retrieval process, after sentences are retrieved but before they are passed to the Language Model (LLM), MetadataReplacementNodePostProcessor replaces each single sentence with its corresponding window of surrounding sentences. So, instead of just analyzing one sentence in isolation, the LLM gets a broader context from the surrounding sentences within the window.

When dealing with large documents or indexes, having this context window approach is advantageous. It allows for retrieving more detailed information and provides better context for understanding the meaning of individual sentences within the larger document or corpus. The default setting for the window size is 5 sentences on either side of the original sentence. So, if you have a target sentence, the window will include 5 sentences before it and 5 sentences after it.

In this setup, chunk size settings are not utilized. Instead, the focus is on following the window settings. This suggests that the window approach is prioritized over chunking the document into smaller parts. Overall, this process enhances the retrieval and analysis of information from large documents by providing a more comprehensive context for individual sentences. With that in mind, let’s proceed to the implementation. During this process, we’ll also delve into the rationale behind each step and how it contributes to our overall benefit.

Setup

!pip install llama-index

%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai

%load_ext autoreload
%autoreload 2

These commands are used in Python environments like Jupyter notebooks to enable automatic reloading of modules. %load_ext autoreload loads the autoreload extension, while %autoreload 2 sets up automatic reloading of all modules before code execution, regardless of previous imports. This feature is useful for quickly incorporating changes made to modules during development without needing to restart the kernel or manually reload modules.

import os
import openai
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

Define your openai key

os.environ["OPENAI_API_KEY"] = "sk-..."

We will take window size of 3 by default it takes 5. For more understanding on parameters you can refer this link.

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)

Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

The temperature parameter for text generation is set to 0.1, increasing output determinism. The embed_model variable employs the HuggingFaceEmbedding framework, utilizing the "sentence-transformers/all-mpnet-base-v2" model known for its proficiency in text embeddings. With a maximum input sequence length of 512, the embedding model ensures efficient processing of text inputs.

Data loading and Index building

Dataset

For this tutorial we are utilising pdf version of a book named Life of Pi by Yann Martel. Link to datasource is shared in notebook as well.

So here, we build an index using full pdf of the book.

documents = SimpleDirectoryReader(
    input_files=["/content/Life-of-Pi-by-Yann-Martel-pdfdrive.com.co.pdf"]
).load_data()

Node extraction

We gather a group of nodes that we want to keep in the VectorIndex. This includes nodes processed by both the sentence window parser and the standard parser.

nodes = node_parser.get_nodes_from_documents(documents)
base_nodes = text_splitter.get_nodes_from_documents(documents)

Index building

We will be building both the sentence index, as well as the “base” index (with default chunk sizes). Point to be noted, it will take some time to create sentence_index so be patient.

sentence_index = VectorStoreIndex(nodes)
base_index = VectorStoreIndex(base_nodes)

Querying

Querying With MetadataReplacementPostProcessor

Here, we use the MetadataReplacementPostProcessor to substitute each sentence within every node with its surrounding context.

query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)
window_response = query_engine.query(
    "What is the significance of the number 227 in 'Life of Pi'?"
)
print(window_response)

#output
The significance of the number 227 in 'Life of Pi' is that it represents the duration of the protagonist's trial, lasting over seven months, during which he survived.

Additionally, we have the ability to examine the original sentence retrieved for each node, along with the surrounding window of sentences that were forwarded to the LLM.

window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

#output
Window: Owen Chase, whose
account of the sinking of the whaling ship Essex by a whale inspired Herman Melville, survived eighty-three
days at sea with two mates, interrupted by a one-week stay on an inhospitable island.  The Bailey family
survived 118 days.  I have heard of a Korean merchant sailor named Poon, I believe, who survived the Pacific
for 173 days in the 1950s.
 I survived 227 days.  That's how long my trial lasted, over seven months.
 I kept myself busy.  That was one key to my survival. 
------------------
Original Sentence: I survived 227 days.

Comparing it with normal VectorStoreIndex

We will use base_index here & create a query engine from the it with a similarity top-k value of 2 and then we will compare results we recieved from base index and sentence_index(got in above cell).

query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(
    "What is the significance of the number 227 in 'Life of Pi'?"
)
print(vector_response)

#python
The number 227 in 'Life of Pi' represents the total number of days Pi spent at sea on the lifeboat with Richard Parker.

When considering Answer A with the sentence_index and Answer B with the base_index, Answer A highlights the direct significance of the number 227 as the total days Pi spent at sea with Richard Parker. On the other hand, Answer B offers a more expansive interpretation, emphasizing Pi's trial duration and his survival over seven months. Both interpretations capture distinct facets of the number 227's significance in the novel's context. Thus, if we require a broader perspective, Answer B's approach should be considered.

Analysis

The combination of SentenceWindowNodeParser and MetadataReplacementNodePostProcessor emerges as the preferred choice due to its ability to capture more nuanced information. By utilizing this combination, we can extract not only individual sentences but also their surrounding context, providing a richer understanding of the text. This approach allows for a more comprehensive analysis, especially in larger documents where contextual information is crucial. Additionally, embeddings generated at the sentence level offer a higher level of granularity, enabling the capture of finer details within the text. Therefore, the SentenceWindowNodeParser and MetadataReplacementNodePostProcessor combo stands out as the clear winner for its capacity to preserve context and capture detailed information effectively.

We can also compare the retrieved chunks for each index!

for source_node in window_response.source_nodes:
    print(source_node.node.metadata["original_text"])
    print("--------")

#output
I survived 227 days. 
--------
I believe the answer lies in something I
mentioned earlier, that measure of madness that moves life in strange but saving ways. 
--------

Here, we can see that the sentence window index easily retrieved two nodes that talk about 227 number. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!

Now, let’s try and disect why the naive vector index behaved that way.

for node in vector_response.source_nodes:
    print("227 mentioned?", "227" in node.node.text)
    print("--------")227 mentioned? False

227 mentioned? False
--------
227 mentioned? False
--------
227 mentioned? False
--------
227 mentioned? False
--------

We knew that 227 was present but it might be in the middle chunk. With LLMs, it is often observed that text in the middle of retrieved context is often ignored or less useful. A recent paper “Lost in the Middle” discusses this here.

So we can actually see what did this text actually look like? using the code below you can observe

print(vector_response.source_nodes[2].node.text)

Evaluation

In order to conduct a thorough evaluation to assess the performance of the sentence window retriever in comparison to the base retriever. you can establish or load an evaluation benchmark dataset and proceed to execute various evaluations on it. It’s crucial to be aware that this process may incur significant costs, particularly when utilizing GPT-4. Exercise caution and adjust the sample size accordingly to align with your budget constraints. I will be showing you significant improvement when you will increase your sample size and othe parameters rest you can modify it according to your usecase.

from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset

from llama_index.llms.openai import OpenAI
import nest_asyncio
import random

nest_asyncio.apply()
len(base_nodes)

#output
216

num_nodes_eval = 10
# there are 216 nodes total. Take the first 25 to generate questions (the back half of the doc is all references)
sample_eval_nodes = random.sample(base_nodes[:25], num_nodes_eval)
# NOTE: run this if the dataset isn't already saved
# generate questions from the largest chunks (1024)
dataset_generator = DatasetGenerator(
    sample_eval_nodes,
    llm=OpenAI(model="gpt-4"),
    show_progress=True,
    num_questions_per_chunk=2,
)

eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()
eval_dataset.save_json("lifeofpie_dataset.json")
eval_dataset = QueryResponseDataset.from_json("lifeofpie_dataset.json")

This code segment is used to generate an evaluation dataset for assessing a model’s performance in generating questions from text data. Here’s a breakdown of what each part of the code does:

num_nodes_eval = 10: Specifies the number of nodes (segments of text) to evaluate. In this case, it's set to 10.
sample_eval_nodes = random.sample(base_nodes[:25], num_nodes_eval): Selects a random sample of nodes from the first 25 nodes in the base_nodes list. These nodes will be used to generate questions.
dataset_generator = DatasetGenerator(...): Initializes a DatasetGenerator object responsible for generating questions from the selected nodes. It takes parameters such as the sample nodes, the language model (LLM) to use (in this case, GPT-4), and the number of questions to generate per chunk.
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(): Calls the generate_dataset_from_nodes() method of the DatasetGenerator object to generate the evaluation dataset asynchronously. This dataset will contain questions generated from the selected nodes.
eval_dataset.save_json("lifeofpie_dataset.json"): Saves the generated evaluation dataset to a JSON file named "lifeofpie_dataset.json".
eval_dataset = QueryResponseDataset.from_json("lifeofpie_dataset.json"): Loads the evaluation dataset from the saved JSON file into a QueryResponseDataset object for further analysis.

Results Comparison

# Import necessary modules
import asyncio
import nest_asyncio
from collections import defaultdict
import pandas as pd

# Apply nest_asyncio to allow nested asyncio event loops
nest_asyncio.apply()

# Import evaluation classes and utilities
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)
from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

# Initialize and configure evaluators for assessing language model performance
# CorrectnessEvaluator evaluates the correctness of generated responses
evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))
# SemanticSimilarityEvaluator assesses semantic similarity between responses and references
evaluator_s = SemanticSimilarityEvaluator()
# RelevancyEvaluator evaluates relevancy of generated responses to input questions
evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))
# FaithfulnessEvaluator assesses faithfulness of generated responses to input context
evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))
# PairwiseComparisonEvaluator can be uncommented if needed

# Set maximum number of samples for evaluation
max_samples = 6

# Extract questions and reference response strings from the evaluation dataset
eval_qs = eval_dataset.questions
ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]

# Set up query engines for base index and sentence window index
# Base query engine
base_query_engine = base_index.as_query_engine(similarity_top_k=2)
# Sentence window query engine
query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # The target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

import numpy as np

# Retrieve responses from the base query engine for evaluation
base_pred_responses = get_responses(
    eval_qs[:max_samples],  # Subset of evaluation questions
    base_query_engine,      # Query engine for the base index
    show_progress=True     # Show progress while retrieving responses
)

# Retrieve responses from the sentence window query engine for evaluation
pred_responses = get_responses(
    eval_qs[:max_samples],  # Subset of evaluation questions
    query_engine,           # Query engine for the sentence window index
    show_progress=True     # Show progress while retrieving responses
)

# Convert responses to strings for easier comparison
pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

# Define a dictionary of evaluators for different aspects of model performance
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}

# Initialize a BatchEvalRunner to evaluate responses in parallel
batch_runner = BatchEvalRunner(
    evaluator_dict,  # Dictionary of evaluators
    workers=2,       # Number of worker processes for parallel evaluation
    show_progress=True  # Show progress during evaluation
)

# Evaluate responses from the sentence window query engine
eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],                 # Subset of evaluation questions
    responses=pred_responses[:max_samples],        # Responses from the sentence window query engine
    reference=ref_response_strs[:max_samples],     # Reference response strings
)

# Evaluate responses from the base query engine
base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],                    # Subset of evaluation questions
    responses=base_pred_responses[:max_samples],      # Responses from the base query engine
    reference=ref_response_strs[:max_samples],        # Reference response strings
)

This code initializes a BatchEvalRunner to evaluate responses in parallel using multiple evaluators. It then evaluates responses from both the sentence window query engine and the base query engine, comparing them to reference responses.

The code below evaluates the performance of the sentence window retriever and base retriever under different conditions and generates DataFrames to display the evaluation results.

# Evaluate performance with num_nodes_eval = 10, base_nodes[:25] (25 nodes), max_samples = 6
# Generate a DataFrame to display evaluation results
results_df = get_results_df(
    [eval_results, base_eval_results],  # List of evaluation results
    ["Sentence Window Retriever", "Base Retriever"],  # Labels for different retrievers
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],  # Metrics to include in the DataFrame
)
# Display the DataFrame
display(results_df)

Output depicting improved results of SWR

Let’s have a look at the output which initial parameters as well.

# Evaluate performance with num_nodes_eval = 10, base_nodes[:25] (25 nodes), max_samples = 5
# Generate a DataFrame to display evaluation results
results_df = get_results_df(
    [eval_results, base_eval_results],  # List of evaluation results
    ["Sentence Window Retriever", "Base Retriever"],  # Labels for different retrievers
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],  # Metrics to include in the DataFrame
)
# Display the DataFrame
display(results_df)

Codebase — https://github.com/Bavalpreet/MediumBlogs/blob/main/MetadataReplacementDemo.ipynb

In conclusion, our evaluation of the Sentence Window Retriever and Base Retriever sheds light on the effectiveness of these retrieval methods in generating responses from text data. Despite limitations in sending requests to GPT-4 imposed by our OpenAI plan, we observed notable improvements in performance with increased max_samples, base_nodes, and num_nodes_eval.

Through our evaluation, it became evident that the Sentence Window Retriever outperforms the Base Retriever, especially when considering a larger number of base nodes and evaluation nodes. This finding underscores the importance of context in text retrieval, as the Sentence Window Retriever leverages surrounding context to provide more accurate and relevant responses.

Therefore, for tasks requiring nuanced understanding and retrieval of information from large documents, such as extracting fine-grained details or capturing broader aspects of the text, the Sentence Window Retriever proves to be the superior choice.