Pick your Suffering: On the Fragility of Generative Models in Summarization

6 min readDec 27, 2022

Back when I was choosing between remaining in academia or continuing my research in a corporate environment, I read a timely book that had one piece of advice that stuck out to me. Pick your suffering.

That is when choosing between two options, it’s natural to think, what is the best bits of both options, and compare them. In contrast, picking your suffering dictates that it may be useful to compare the worst parts of both solutions, and pick the least worst.

In life, you may want to be less conservative and defensive when picking your challenges, however, when it comes to developing solutions and deploying software, the battles you pick will define the success of your product.

Today’s battle is Text Summarization in the Life Sciences.

When choosing between generative models for summarization, and extractive, we may want to take the same advice, Pick your suffering.

When a generative model works at its best, it can be incredible. More readable, and more intuitive than extractive summarization by far.

However, if we look at the suffering we are choosing when we go generative over extractive, we see a very different picture, and depending on your use case, this may be the straw that breaks the camel’s back.

In a previous proof-of-concept, I built a system for “truth-constrained summarization”, as an attempt to bring Grices Maxim of Quality to the generative world.

The maxim of quality is where one tries to be truthful, and does not give information that is false or that is not supported by evidence.

In our “truth-constrained summarization” attempt, we use a combination of generative models, and entailment, to ensure that the generative model isn’t creating text that cannot be substantiated by the document we are trying to summarise.

This is only required because we know that generative models do hallucinate and add text that was not in the original. While the approach was promising, the underlying assumption is that the entailment model, in itself a generative model, is reliable, whereas generative summarization is not.

This was a fair assumption for a proof of concept, but the ramifications are quite important.

Firstly we can look at the reason we made this assumption. The model in question, the T5-Flan, has a prompt set-up that is not prone to the random generation of text, it is constrained to three outputs “yes”, “no” and it “is not possible to tell”. This is perfect as it means we no longer require natural language understanding to understand our model output.

By choosing a prompt-training structure that removes random generation and interpretation, we can stop worrying about the meaning of the generated text, and instead ask the more familiar question, Is the output reliably correct? Can we trust it to do the heavy lifting of truth constraints in text generation?

The answer, unsurprisingly, is no. The same fragility persists, even within the well-defined, prompt constrained space that we are using.

Here we show that entailment models are fallible and can give contradicting output for the same input, depending on the final form of the prompt.

Take this example prompt (you can try it out here):

Premise: The man has a cat. The cat has a hat. It is red.
Hypothesis: The cat is red.
Does the premise entail the hypothesis? Explain your answer
>>> It is not possible to tell that the cat is red from the information given. The hat on the cat is red. So the answer is it is not possible to tell.

Here we see that the answer is correct. The It from “It is red” is correctly resolved to mean, the hat. This is a triumph of reasoning from natural text, and a lot of nuance goes into creating the correct answer. The only downside here is that to use this output in a larger system, we must then interpret this answer to be a yes, or no, or it’s not possible to tell.

Luckily, we can alter the prompt to get this constrained output, by not asking for an explanation of the reasoning:

Premise: The man has a cat. The cat has a hat. It is red.
Hypothesis: The cat is red.
Does the premise entail the hypothesis?
>>> yes

Wait, yes?

Here we’ve got our red flag. First, it’s wrong. But we can forgive that, every model is wrong sometimes. But what is more problematic, is that the model is unreliable. That sophisticated reasoning that is displayed above is not a reliable measure of what is happening under the hood.

Intuitively, adding an explanation, should not in any circumstance, change your answer, only elaborate on it. But that is not what we see.

This is incredibly problematic, as the simple prompt, which returns either yes, no, or it is not possible to tell, is precisely the constrained output we want to use when we use entailment within a larger system, such as summarization.

This is troubling and underlines a few things to be wary of with current generative models. In their raw form, they are incredibly fragile and fail in unexpected, and are hard to guard against ways. While we could find other prompts that circumvent this issue, can we trust that it doesn’t have its own unreliable failure modes?

This inability to easily find and correct failure modes in generative models may be the perfect illustration of “Choose your suffering”. If your use case depends on the accuracy of your text, and the reliability of your systems, choosing the path of policing a hallucinating unpredictable generative model, may be choosing a rabbit hole of suffering.

Policing the output of a generative model, requires the exact natural language understanding capabilities, that the generative model was supposed to solve.

Does extractive model summarization always preserve meaning?

On the flip side, in a discussion I had recently about summarization and generative models, the following was mentioned as a reasonable argument for choosing extractive over generative models in truth-dependant domains.

We cannot afford for a summary to misrepresent the underlying text, so we extract from the core text instead of generating new text for a summary.

Generally, this is a fair thing to say, generative models are known to hallucinate, and policing them is hard. However, it is worth thinking about the truth of this sentiment. Are extractive models immune to text misrepresentation? Absolutely not. Let us use the example above, and see some potential failings of a naive extractive approach:

The man has a cat. The cat has a hat. It is red.

An extractive method, which picks out the top n sentences, may suggest that the cat having a hat is less important that the colour of the hat, and as such provide the following summarization:

The man has a cat. It is red.

We can see here, that due to a failure of co-reference resolution, we have completely changed the meaning of the sentence.

While this is a small edge case, and a trivial example, it is at least instructive. For one, tackling this issue is considerably more well-defined than fact-checking the generative case. For instance, to reduce this edge case, we could:

Fall back to co-reference resolution pre-extraction, as a way to minimise this occurrence.
Use entailment models to check for any introduced incongruities, (reintroducing the main problem of fragility.)

Here, the suffering we have picked, is much more reliably dealt with, with an established set of tools and methods, without re-introducing the very issues we’re trying to solve.

In the end, the product may be less flashy, but more reliable, explainable and maintainable. Choose the suffering you can best handle, and deliver the product your can best iterate on.

Conclusion

While neither the generative nor extractive approach is perfect, in this case, we can see that chasing down the edge cases of extractive summarisation, is a much more constrained problem than chasing down the edge cases of generative models.

In the end, it depends on your final use case, and improvements to both approaches are continuously being developed. Whatever you chose, one thing is certain, it’s going to be an exciting year ahead in AI and NLP.

Pick your Suffering: On the Fragility of Generative Models in Summarization

Does extractive model summarization always preserve meaning?

Conclusion

Written by Adam Tomkins