Evaluating Chain of Density Method for Better LLM Summarization

Published in

Yugen.ai Technology Blog

11 min readSep 13, 2024

Context

In the first blog of the summarization series, we discussed Retrieval Augmented Generation (RAG) architecture as a method for retrieving and summarizing the relevant information from a lengthy document. We concluded that summaries generated by LLMs have two limitations — non-reproducibility and entity sparsity. Here is a quick overview of these limitations:

Non-Reproducibility: When an LLM summarizes a document, it generates a slightly different summary each time.
Entity-sparsity: This occurs when an LLM generates an overly simple summary with few relevant entities and facts.

In the second blog of this series, we discussed the Majority Voting technique to address non-reproducibility. This technique generates multiple raw responses and merges them using LLM to generate multiple merged responses. The most reproducible merged response is then chosen as the final output. It was observed that non-reproducibility was significantly reduced due to the merging of raw responses; however, the merged responses were not exhaustive and missed some entities from the raw responses.

Objective

In this blog, we’ll explore a prompt engineering technique — Chain of Density (Adams et al., 2023) that focuses on the second limitation of summarization using LLMs (i.e., entity-sparse summaries), with the following two objectives.

To evaluate the effectiveness of the Chain of Density (CoD) technique in generating more entity-dense summaries.
As a secondary objective, we’ll analyze whether the summaries generated by the CoD technique are reproducible.

Solution

Before deep diving into the solution pipeline, here is a quick overview:

The LLM first generates an initial document summary, which is typically simple and entity-sparse.
Then, through iterative improvement, the LLM generates an improved summary of the same length by integrating relevant entities that were missing in the previous summary.
The LLM is instructed to maintain the same length for the new summary while increasing the number of entities through fusion and compression of the content, rather than omitting meaningful content from previous summaries.
This improvement is repeated multiple times, and the last improved summary is returned as the final output.

Iterative Improvement

To summarize a document, the CoD technique implements an iterative improvement of the initial summary. Here is how the CoD technique works:

Each iteration of the CoD technique involves multiple Internal Steps.
First, the LLM generates an initial summary of the provided document, which is typically simple and sparse. This initial summary is referred to as Internal Step Response — 1 (ISR-1).
To create a new summary, the LLM is instructed to identify the entities that are relevant to the overall theme of the document but were missing from the previous (most recent) summary.
After the LLM identifies the missing entities, it is asked to integrate those entities into the new and improved summary, thereby enriching the previous summary.
The LLM is provided with the immediately previous ISR and the original document to generate each subsequent ISR.
This process repeats until all internal steps are executed.

Evaluation Metrics

We use two different sets of metrics: one to measure the information richness, and another to measure the reproducibility of final responses.

Measuring Information Richness

The following metrics are used to measure the effectiveness of CoD’s iterative improvement:

Number of Entities
Number of Tokens
Entity-Token Density

The number of entities is computed for each response using spaCy (model — en_core_web_sm). All entities except those labeled as ‘CARDINAL’ are considered.

The number of tokens and entity-token density is calculated as follows:

Number of Tokens = Number of Characters / 4
Entity-Token Density = Number of Entities / Number of Tokens

Measuring Consistency

We use Cosine Similarity to measure Consistency (or Reproducibility). The following vectorization methods are utilized while computing Cosine Similarity:

TF-IDF
Word Embeddings

The reason for using both TF-IDF and Embeddings is that Embeddings has the advantage of handling semantic similarity and synonymous words in different summaries, while TF-IDF has the advantage of incorporating the importance of unique terms such as names of people, places, etc., in the text.

CoD Hyperparameters

The following are the two hyperparameters in the CoD technique:

Word Count: The number of words in the summary. The LLM will be instructed on this through the prompt.
Internal Steps Count: The total number of times the iterative (sequential) improvement happens, in one iteration of the CoD method.

Setting up the hyperparameters:

Internal Steps Count is set to 5.
Word Count is set to 1000.

Model

OpenAI’s LLM GPT-4o is used for generating the responses. The model parameter seed is set to a constant, and the temperature is set to 0 so that LLM tries to generate deterministic outputs.
To compute Cosine Similarity using embeddings, we use OpenAIEmbeddings() with the text-embedding-ada-002 model.

Data

The NVIDIA Annual Report (10-K) for FY 2023–24 is used for summarization.

Pipeline

Here are the pipeline steps for implementing the Chain of Density technique:

Generate Initial Summary: Use the prompt and the document to generate an initial summary, by providing the values of word count for the summary.
Generate Subsequent Summaries: Use the initial summary, prompt, and document to generate an improved summary. The LLM is instructed to fuse missing entities from the previous summary to generate a new and improved summary of a fixed length. Then use the improved summary to generate an even more improved summary of a predefined fixed length by fusing more missing entities. This iterative improvement process repeats 4 times (since n = 5), which generates Internal Step Responses, ISR-2 to ISR-5.
Return the Final Output: The summary obtained from the last step is the final output for the document. Therefore, ISR-5 is the Final Response (or FR).
Analysis: Compute and analyze each ISR’s information richness and reproducibility metrics.

Prompts

The standard prompt template for this method is given in the original paper. The important portion of the prompt for generating the initial summary and subsequent summaries are as follows.

Initial Summary

Your task is to fetch all details and insights like current investments, 
news, insights, plans, strategy, regulations, current and future partnerships, 
etc. related to technology, data science, artificial intelligence, 
machine learning, digital transformation, etc.
Then summarize and re-organize that information for use by investors and 
consultants. 
Note that the summary should contain 1000 words.

Subsequent Summaries

A Missing named Entity is a named entity such as the name of a person, 
organization, place, etc. which is:
- Relevant: important to the overall document and relevant for the summary.
- Specific: descriptive yet concise (5 words or fewer).
- Novel: not in the previous summary.
- Faithful: present in the given document.
- Anywhere: located anywhere in the document.

Guidelines:
- Make every word count: rewrite the previous summary to improve 
  flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases 
  like "the document discusses".
- The summaries should become highly dense and concise yet self-contained, 
  e.g., easily understood without the document.
- Missing entities can appear anywhere in the new summary.
- Never drop entities from the previous summary. 
  If space cannot be made, add fewer new entities.

Results

To generalize the performance and reproducibility of the CoD method, the results from 5 different iterations were obtained.

Number of Entities

The following graph shows the number of entities in each ISR of each iteration:

Figure 1 shows that:

In Iterations 1, 3, and 5, the Number of Entities (NoE) either increases or remains the same over subsequent internal steps.
In Iterations 2 and 4, the NoE decreases once and then remains the same.
The final response from Iteration 3 has the highest number of entities among all iterations.
The largest increase in the NoE occurred in Iteration 5, likely because this iteration started with a very low NoE, compared to other iterations.
The NoE in the final responses range from 34 to 47, which shows that they are likely not reproducible.

Observations about the number of entities over subsequent Internal Steps (IS):

From IS1 to IS2, the NoE increased in 2 iterations, decreased in 2 iterations, and remained unchanged in 1 iteration.
From IS2 to IS3, the NoE remained unchanged in all iterations.
From IS3 to IS4, the NoE increased in 2 iterations and remained unchanged in 3 iterations.
From IS4 to IS5, the NoE remained unchanged in all iterations.

Therefore, Figure 1 shows that the NoE does not improve with each internal step, it even reduces in some iterations, and a general saturation in NoE occurs at IS2 and a complete saturation occurs at IS4.

To understand why the LLM drops some entities, we study Iteration-4, so, the following table shows the entity and token dynamics for Iteration-4 over the iteration steps:

Figure 2: Entity and Token dynamics in Iteration 4, compared to the previous Internal Step — Figure 2: Entity and Token dynamics in Iteration 4

Figure 2 shows that:

The initial summary (Internal Step Response — 1 or ISR-1) consists of 1283.25 tokens and 40 entities.
From IS1 to IS2, the LLM drops 9 entities and adds 3 entities, resulting in 34 entities in ISR-2.
In ISR-2, some of the removed entities are reintroduced using different phrasing, for example, ‘1%’, and ‘fiscal year 2025’. However, other entities like ‘GPU’, and ‘DGX Cloud’ are completely removed from the ISR-2.
From IS1 to IS2, the LLM reduces the summary length to 899.5 tokens by removing redundant words, resulting in a higher Entity-token density.
After IS2, the summary remains unchanged.
Therefore, the LLM is not adhering to the instructions to retain all current entities, and generating an improved summary of pre-defined constant length.

To understand what type of missing entities are integrated by the LLM, we study Iteration-5, so, the following table shows the entity and token dynamics for Iteration-5 over the iteration steps:

Figure 3: Entity and Token dynamics in Iteration 5

Figure 3 shows that:

In Iteration 5, the initial summary (Internal Step Response — 1 or ISR-1) consists of 959.5 tokens and 24 entities.
From IS1 to IS2, the LLM drops 1 entity and adds 5 new entities, resulting in 28 entities in ISR-2. The entities added by the LLM in ISR2 are the names of places and individuals. The summary length is reduced to 812.25 tokens by removing redundant words, resulting in a higher Entity-token density.
From IS2 to IS3, the summary length is increased from 812.25 to 830.25. However, no entities are added or dropped.
From IS3 to IS4, the LLM adds 10 entities (few of the entities consist of multiple words separated by a comma), resulting in 38 entities. The new entities added by the LLM in ISR4 are the names of geographical regions, investment funds, and corporations. The summary length is increased from 830.25 to 889.75.
From IS4 to IS5, there is no change in the summary.
Therefore, in Iteration 5, the LLM is improving the summary to a certain extent by fusing new information related to diverse types of entities — geographical regions, investment funds, individuals, and corporations.

Entity-Token Density

The Entity-Token Density (ETD) for 5 different iterations (each having 5 Internal Steps) is shown as follows:

Figure 4 shows that:

The Final Response from Iteration 5 has the highest value of ETD among all Final Responses.
The largest increase in ETD occurred in Iteration 5, this iteration started with an initial summary having the lowest entity density.

Observations about ETD over subsequent Internal Steps:

From IS1 to IS2, the ETD increased in all iterations, except in Iteration 3.
From IS2 to IS3, the ETD decreased in 1 iteration and remained unchanged in 4 iterations.
From IS3 to IS4, the ETD increased in 2 iterations and remained unchanged in 3 iterations.
From IS4 to IS5, the ETD remained unchanged in all 5 iterations.

Therefore, Figure 4 shows that the ETD does not improve with each internal step, it even reduces in one instance, and a general saturation in ETD occurs at IS2 and a complete saturation occurs at IS4.

Cosine Similarity

To evaluate the reproducibility of final responses generated by CoD, the cosine similarity was computed among the Final Responses of the 5 iterations. The cosine similarity matrices using TF-IDF and Embeddings for the Final Responses (FR) from all iterations are as follows:

Figure 5 shows that:

The Cosine Similarity (TF-IDF) among final responses varies from 0.78 to 0.92.
The Cosine Similarity (Embeddings) among final responses varies from 0.79 to 0.98.
Since most of the cosine similarity values are far below the desired value of 1.00, the final summaries are said to be non-reproducible.

Token Usage

The estimated token usage for one iteration of the Chain of Density method, for NVIDIA’s annual report having 90,000 tokens:

Figure 6 shows that CoD is computationally expensive, as the whole document is processed multiple times.

Challenges

Following are the challenges in implementing the CoD technique for effective summarization, however, most of these limitations are attributed to the LLM rather than the technique.

Non-Adherence

The LLM does not follow the instructions given in the prompt:

LLM generates summaries significantly shorter than the provided word count of 1000 words.
While finding missing entities from the previous summary, the LLM does not comply with the instructions that it has to fetch only a few of the most relevant missing entities.

The prompt instructs the LLM to ‘Never drop entities from the previous summary. If space cannot be made, add fewer new entities’. However, it still drops some entities in subsequent steps in some iterations.

Early Saturation in NoE and ETD

The ETD does not improve after a certain number of internal steps. The NoE even decreases in internal steps, in some iterations. Therefore, this method is not very effective at generating entity-dense summaries.

Non-Reproducible Responses

The summaries generated by the CoD method across multiple iterations vary significantly in NoE and ETD, and the Final Responses are not reproducible.

Challenges with Lengthy Documents

Since the summaries have to be improved sequentially using previous summaries and the whole document, this method can’t be used (in its current form — without using RAG) if the document is lengthier than the LLM’s input context limit.

Latency

In the CoD method, the summaries are improved sequentially over multiple internal steps. It may cause latency issues and hamper the user experience. Also, since the summaries have to be improved sequentially using immediately previous summaries, parallel processing using asynchronous API calls can’t be used.

API Cost

To generate a final output, the API request is sent multiple times, and in each internal step, both the document and the previous summary must be supplied. This solution can be expensive when working at scale. Open-source models can be explored to address this challenge.

Conclusion

We did not observe promising results for the NVIDIA document using the current values of hyperparameters, due to the challenges attributed to the LLM rather than the technique itself. Therefore, further experiments may be performed using multiple different values of hyperparameters, LLMs, and types of documents for an exhaustive evaluation of this technique. The upcoming blog will explore two more Summarization methods — Map Reduce and Refine.

If you’re interested in exploring the use cases for Generative AI and LLMs, we’d be happy to connect on LinkedIn and talk. Please share your insights and thoughts in the comments below!

References

Adams, G., Fabbri, A., Ladhak, F., Lehman, E., & Elhadad, N. (2023). From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2309.04269
What are tokens and how to count them? (n.d.). OpenAI. Retrieved August 30, 2024, from https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
Yugen.ai. (2024, August 30). Navigating Indeterminism: Improving Reproducibility in LLMs. Medium. https://medium.com/yugen-ai-technology-blog/navigating-indeterminism-improving-reproducibility-in-llms-945362d3912c
Yugen.ai. (2024, August 14). Unlocking Financial Insights: RAGs for Precise Information Retrieval from 10-K reports. Medium. https://medium.com/yugen-ai-technology-blog/unlocking-financial-insights-rags-for-precise-information-retrieval-from-10-k-reports-2b631cf7c70a