Evaluating LLMs Summarizations

Overview of different approaches for evaluating summarizations generated by LLMs

Published in

DarrowAI

5 min readMay 21, 2024

LLMs have stepped into almost every field in NLP (Natural Language Processing), some would say even bringing us closer to full AI (a.k.a AGI). Amongst these fields, text summarization is considered one that has received lots of attraction and progress thanks to the major advancements in Generative AI, making us rethink our existing approaches and calibrate our workflows.

Text Summarization in a nutshell

Imagine you are given a legal case document, containing hundreds of pages, thousands of sections and subsections, and lots of facts, that when put together, describe the overall occurrence of the matters in the case.

Reading the entire document may be inefficient and unnecessary as only parts of it contain relevant information from the main topics. What we would like to have is a way to briefly skim a condensed, but yet reliable, version of the original document, one that makes us feel confident we haven’t missed any important information we must have.

To put it differently, we need a well-structured, cohesive, reliable and short summary of our original document.

Summarization with LLMs

Recent advancements in GenAI have made it possible to prompt an LLM with a simple request to summarize any input text and get back a shorter version of it, that, according to Microsoft, preserves the same meaning.

When AI like ChatGPT receives the command to summarize or paraphrase a block of text, it expresses the same ideas while using different words and sentence structures. The meaning is the same, but the content is put together in a new way. The AI might change words around, substitute synonyms, or entirely restructure a block of text. (Microsoft)

Adding the fact that new LLMs are soon to support 1M tokens as context window size (i.e. 750K words or 1500 pages), this makes them perfect for summarizing extremely long texts such as textbooks, magazines, papers, or in our case legal cases.

As we roll out the full 1 million token context window, we’re actively working on optimizations to improve latency, reduce computational requirements and enhance the user experience. (Google)

Evaluating LLMs Summaries

There’s no doubt LLMs have made the summarization task easier, but what about the evaluation? After all, in most cases, you can’t blindly trust the text produced by the LLM and present it as is to your users.

Let’s review two interesting methods for evaluating summaries that were generated using LLMs which were presented in the last EACL2024 conference that was held in Malta earlier this year.

Since at Darrow, we are a legal tech company, driven to create a world where justice is frictionless and accessible to all, our next examples will use texts taken from our in-house legal dataset, containing more than 100k past and ongoing class-action legal cases.

Method 1: Evaluating Using Guidelines Questions

In this approach, before sending our documents to the LLM for summarization, we prepare a set of distilled yes/no questions that together verify all the quality aspects of our generated summary. Each question alone, asks about a very specific criteria that measures a single aspect of the summary.

For simplicity, we can divide the questions into three groups:

Structural-related questions — here we ask about the structure of the summary and make sure it follows our standards.
Content-related questions — here we ask about how good the generated content is and whether it consists of all our must-have information.
Domain-related questions — here we ask more specific questions that can distinguish between a general good summary and a domain-specific good summary.

Examples of such questions:

(Structural) Is the summary both well-organized with headings or sections for cohesive presentation and free from significant omissions that might impact a comprehensive understanding?
(Content) Do the facts of the summary still follow the correct order chronologically?
(Domain) Does the summary provide a clear overview of the main issue or legal violation?

In our practices, we noticed that a few (4–5) questions from each category did a good job and captured most of the requirements we had from the summary.

In addition, you could use these questions to further improve your summarization prompt and be more specific with your desired output.

Also, to overcome the position bias where the different ordering of the questions leads to different results, you can prompt the LLM multiple times and shuffle the questions between the runs, then average the results for each question.

Moreover, instead of binary yes/no questions, you can prompt the LLM with a request to rate the result on a 1–5 scale, in such cases we noticed that the description for each score in the scale should be also included in the prompt.

Lastly, we had some experiments on using different LLMs (mixture of experts), others than the one that generated the summary, to answer these questions and score the summary, but that’s a topic for a different post.

Method 2: Evaluating Using Fact-Checking

Looking at legal case documents, one can argue that at the base of it, we have a complex legal story, usually describing some violations that were made and consisting of many facts presented in chronological order, describing the overall sequence of events in the case.

In this approach for evaluating the summary, we will focus on the most important facts that appear in the original text, and make sure they exist also in the generated summary.

The advantage of this approach is twofold: (1) It gives us the ability to compute standard ML metrics such as [precision, recall and f1] by counting the facts we were able to identify in the summary, thus making the optimization objective clear and concise. (2) It focuses on making the content as reliable as possible by making sure as many facts as possible are covered.

ML metrics for facts checking in a text summary

Matching Facts

Given an original long text document and a corresponding summary generated using an LLM, to compute the metrics mentioned above, we still need a way to match between facts in the original document and the summary to count how many we covered (measuring the quality of the facts is also important but won’t be discussed here).

Here are a few options for facts-matching:

Basic Similarity:
With unlimited ways to embed short texts into vectors, you can take each matched pair of facts and embed them into a lower dim space, then compute the similarity and decide on a threshold that optimizes your metrics.
NLI (Natural Language Inference):
Given 2 facts, use an existing NLI model to classify them into one of the categories <entail, contradict, neutral>, then use the entailed pairs to set a threshold that gives satisfying results.

Epilogue

Evaluating summaries has always been a challenging task, also for humans. Even if you take several domain experts in a specific field, you will likely get disagreements that will make the review process difficult. Leveraging LLMs and other ML techniques to help skim through summaries and evaluate them can reduce your human labeling efforts, which itself is not intuitive, and make your pipelines more automated over time.

If you’d like to hear more about what we’ve been working on, don’t hesitate to contact me.

Yaron