Summarization @ EMNLP 2022

Published in

The Observe.AI Tech Blog

9 min readFeb 27, 2023

The present work aims to provide a concise overview of the current trends in research on summarization, by offering a synopsis of various papers that were accepted in EMNLP 2022. The primary objective is to succinctly capture the problem space, research objectives, experimental design, achieved results, and fundamental ideas of the methodologies adopted in each of the papers, without delving too deeply into their technical intricacies. The underlying motivation is to provide readers with a comprehensive understanding of the latest developments in summarization research.

Paper highlights

Paper #1 — Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling by Vidhisha Balachandran, Hannaneh Hajishirzi, William W. Cohen, Yulia Tsvetkov

Motivation

Abstractive summarization models often generate inconsistent summaries containing factual errors or hallucinated content
Post-editing correction models are trained using adversarial non-factual summaries constructed using heuristic rules which often does not generalize well to actual model errors

Objective

To train a robust fact-correction model to post edit the generated abstractive summaries to improve factual consistency

Approach (FACTEDIT)

The idea is to generate hard, representative synthetic examples of non-factual summaries through infilling language models and this data is used to train a more robust fact-correction model to post edit the generated summaries
The key components of the model include training an infilling data generator to take a masked sentence and its relevant context as input and generate a correct phrase to fill in the masked span followed by adversarial data generation for training the correction model and finally training the same using the parallel adversarial data generated in the previous step

Experiments

The datasets used for the experiments include CNN/DM, XSum & FRANK benchmark (Pagnoni et al., 2021)
Factual Consistency was evaluated using FactCC (Kryscinski et al., 2020), Ent-DAE (Goyal and Durrett, 2021) and ROUGE
Model trained on adversarial data generated via heuristic rules by (Cao et al., 2020) was used as baseline

Results

FACTEDIT vastly outperforms prior methods in correcting erroneous summaries. Factuality scores improved by over ∼11 points on CNN/DM and over ∼31 points on XSum on average across multiple summarization models

Paper #2 — Generating Multiple-Length Summaries via Reinforcement Learning for Unsupervised Sentence Summarization by Dongmin Hyun, Xiting Wang, Chanyoung Park, Xing Xie, Hwanjo Yu

Motivation

Recent unsupervised models are extractive, which are less flexible than abstractive summarization
Recent extractive models outperformed the abstractive models
Downside of extractive models is that they can only select words from texts, and thus they cannot generate new words that can be effective for sentence summarization
Summary quality of existing abstractive models is sometimes worse than a simple baseline

Objective

To devise an abstractive summarization model based on reinforcement learning without ground-truth summaries for the task of sentence summarization

Approach

Titled Multi-Summary based Reinforcement learning with Pre-training (MSRP)
Employs Reinforcement Learning (RL) for unsupervised abstractive summarization
RL enables a model to learn to summarize using rewards even though they are non-differentiable
The proposed model generates high-quality summaries considering the semantic similarity between the generated summary and its corresponding input text & the fluency of the generated summaries
A multi-summary learning mechanism is developed that generates multiple summaries with varying lengths for a given text
To obtain well-initiated model parameters for the RL training, pre-training is done on augmented data by applying word-level perturbations and inserting length prompts.
Model is pre-trained to reconstruct the original text from the augmented one, which makes the model learn to summarize and control the output length

Experiments

Gigaword dataset was used for training and evaluation while DUC2004 was used only for evaluation
Evaluation Metrics include ROUGE score and human evaluation. In addition, the fidelity of the generated summaries was calculated using SentenceBERT and the fluency was calculated with GPT2.
Models were grouped based on the average length of the generated summaries for fair comparison
Both abstractive and extractive models were compared

Results

Achieves the best ROUGE, fidelity, and competitive fluency
Inference time is competitively short compared to the state-of-the-art baseline models

Paper #3 — Improving abstractive summarization with energy-based re-ranking by Diogo Pernes, Afonso Mendes, André F.T. Martins

Motivation

Omission of relevant information and Hallucinations are two key challenges associated with abstractive summarization
Detecting hallucination is hard as Rouge scores fail to distinguish factual and non-factual summaries

Objective

To learn to re-rank summaries according to one or a combination of summarization metrics to create quality-aware abstractive summarizers

Approach

A set of candidate summaries is sampled and then a re-ranking model is used to choose the best one
To ensure diverse candidates, the authors experiment with diverse beam search, a modification of traditional beam search including a term in the scoring function that penalizes for repetitions across different beams
The goal of `Energy based reranking` is to find a reference-free function E : X × Y → R with parameters θ such that, for two candidate summaries yˆ and y/ˆ for the same document x with reference summary y, E(x, yˆ; θ) < E(x, y/ˆ; θ) if and only if φ(x, y, yˆ) > φ(x, y, y/ˆ)
The intuition is that `E` should assign low energy wherever p(y | x) is high and high energy wherever p(y | x) is low, but does not need to be normalized as a proper density
At inference time, the energy model scores are used to re-rank candidate summaries previously obtained from a baseline summarization model
Training data set comprises source document, reference summary and k candidate summaries sampled from a baseline summarization model, such as BART or PEGASUS
ListMLE ranking loss (Xia et al., 2008) is used as the training objective
Adopted metrics for training energy based re-ranking model include CTC Score (Deng et al., 2021), QuestEval (Scialom et al., 2021), ROUGE-L and FactCC (Kryscinski et al., 2020)

Experiments

Two datasets were used viz. CNN/DailyMail, Hermann et al. (2015) and XSum, Narayan et al. (2018)
BART model (Lewis et al., 2020) was used as the baseline
The model was compared against BRIO by Liu et al. (2022), CLIFF by Cao and Wang (2021), DAE by Goyal and Durrett (2021), FASum by Zhu et al. (2021) and SummaReranker by Ravaut et al. (2022)
Factual consistency and relevance were measured via CTC scores. Human evaluation was done too.

Results

Yields improvements over the usual beam search on a baseline model and demonstrates the ability to distill target metrics
Human evaluation results suggest that re-ranking according to these metrics, while competitive, may yield lower quality summaries than those obtained by state-of-the-art abstractive systems trained with augmented data and contrastive learning

Paper #4 — STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension by Meta AI

Motivation

Abstractive Dialogue Summarization (ADS) has been viewed as an important standalone task in NLP
No previous work has explored the possibility of whether ADS can be used as a means to boost performance on other important dialogue comprehension tasks
When performing language understanding, human beings try to summarize the main content of a piece of text, usually from multiple perspectives each focusing on a different aspect of the text

Objective

To utilize a dialogue summarization task to pre-train language models to better understand dialogues and improve their performance on dialogue comprehension tasks

Proposal:

Novel type of dialogue summarization task — STRUctured DiaLoguE Summarization (STRUDEL) that can help pre-trained language models to better understand dialogues and improve their performance on important dialogue comprehension tasks
Collect human annotations of STRUDEL summaries over 400 dialogues sampled from two widely used dialogue comprehension datasets
Introduce a new STRUDEL dialogue comprehension modelling framework that integrates STRUDEL into a graph-neural-network-based dialogue reasoning module over transformer encoder language models to improve their dialogue comprehension abilities

Definition of Structured Dialogue Summarization (STRUDEL)

It is the task of generating a systematic and abstractive multi-entry dialogue summarization organized in a structured form that represents a comprehensive multi-aspect understanding and interpretation of a dialogue’s content
A complete STRUDEL summarization of a dialogue contains a set of 7 different entries viz. Relationship, Purpose/Theme, Task/IntentionS1, Task/IntentionS2, Problem/Disagreement, Solution and Conclusion/Agreement

Approach

STRUDEL can be viewed as an upstream auxiliary NLU task and can be used to train language models before they are further fine-tuned over specific downstream dialogue comprehension tasks
Transformer encoder language models are trained to generate semantic vector embeddings of the contents of entries instead of the actual textual outputs of the entries in the form of token sequences
A graph representation of STRUDEL embeddings is constructed called dialogue semantic graph
The QA-GNN (Yasunaga et al., 2021) architecture is adopted as the GNN reasoning module to perform context-aware structured reasoning over generated STRUDEL embeddings
Model is trained in two ways viz. Multi-Task Post-Training and Single-Task Fine-Tuning

Experiments:

Datasets: DREAM (Sun et al., 2019) was used for the task of Dialogue question answering while MuTual (Cui et al., 2020) was used for the task of Dialogue response prediction
BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) were used as encoders

Results

The performance of STRUDEL on both the dialogue comprehension tasks is consistently higher than the corresponding backbone transformer encoder models

Paper #5 — Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation by Melanie Sclar, Peter West, Sachin Kumar, Yulia Tsvetkov, Yejin Choi

Motivation

Prior work in sentence summarization assumed access to large-scale text-summary paired datasets which are costly to create
Naturally-occurring summarization datasets are noisy and not easily found in other domains
Unsupervised or self-supervised methods lead to less fluent summaries
Real world applications require controlling for summary length
Prior works on knowledge distillation assume having a model trained for the desired task and with access to its full distribution of token logits
Prior works on knowledge distillation aim to mimic the teacher model’s distribution

Objective

To demonstrate that reference-free, controlled sentence summarization is feasible
To demonstrate how iterative knowledge distillation can lead to considerably smaller, but better summarizers with sharper controllability

Research question

Is it possible to learn a much smaller but better summarizer starting from GPT3

Approach

Using symbolic knowledge distillation, shorter summaries of similar quality are generated than those generated by the original GPT3 model
Smaller summaries are iteratively generated by summarizing the summaries generated in the previous iteration
The key to selecting the most desirable summaries is in the filtering process. The filtering process is based on three criteria viz. Fidelity filter, Length filter and Context filter
The symbolic knowledge distillation process works by generating data from the teacher model (GPT3), filtering to keep only the desirable summaries and fine-tuning the student model on those generations which is called “Referee-Distill”. This process is repeated iteratively with the fine-tuned student model acting as teacher model at every step
Now, there are models that can create summaries of varied compression ratios
With this, a model can be built that can simultaneously and fluently compress at any length by adding control codes to this same fine-tuning process — this model is called “Referee-control” and it significantly outperformed GPT3 (teacher model)

Experiments

Dataset: RealNews (Zellers et al., 2019)
WANLI (Liu et al., 2022a) was used to create the NLI filter
GPT2-Large fine-tuned on Gigaword (Napoles et al., 2012) was used as the supervised baseline
Evaluation was done based on the Compression & Fidelity Statistics and Human Evaluation

Results

The final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the quality of resulting summarization

Paper #6 — Curriculum Prompt Learning with Self-Training for Abstractive Dialogue Summarization by Changqun Li, Linlin Wang, Xin Lin, Gerard de Melo, Liang He

Motivation

Low information density, topic drift and insufficient training data are some of the key challenges associated with Abstractive Dialogue Summarization (ADS)
Low information density: Salient information is often scattered across multiple utterances by different interlocutors
Topic drift: The topics being discussed can vary during the progression of a conversation
Insufficient training data: 137 meetings vs 312K articles in CNN/DM

Objective

To develop a model for the task of ADS by using prompt-learning with high-level semantic understanding and subsequently utilize relevant signals from unlabeled data in order to make it effective in low resource settings

Approach

Employs Transformer based encoder-decoder architecture with heterogeneous prompts
Prompts are heterogeneously constructed via Curriculum learning in order to help learn essential features for the dialogue understanding and improve the generalization of soft prompts via additive perturbations
Topic-aware prompts are used for planning which helps in better controlling the generation of summaries
These prompts are then optimized with self-training in order to harness unlabelled dialogues

Experiments

AMI, ICSI and SamSum were the datasets used in the experiments
Automatic evaluation was done using ROUGE-1, ROUGE-2 & ROUGE-L while human evaluation was used for fluency, informativeness and relevance

Results

Better in terms of both automatic and human evaluations
Achieves new SOTA in dialogue summarization
Outperforms the previous SOTA (BART large) in few shot settings as well

Summarization @ EMNLP 2022

Paper highlights

Paper #1 — Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling by Vidhisha Balachandran, Hannaneh Hajishirzi, William W. Cohen, Yulia Tsvetkov

Motivation

Objective

Approach (FACTEDIT)

Experiments

Results

Paper #2 — Generating Multiple-Length Summaries via Reinforcement Learning for Unsupervised Sentence Summarization by Dongmin Hyun, Xiting Wang, Chanyoung Park, Xing Xie, Hwanjo Yu

Motivation

Objective

Approach

Experiments

Results

Paper #3 — Improving abstractive summarization with energy-based re-ranking by Diogo Pernes, Afonso Mendes, André F.T. Martins

Motivation

Objective

Approach

Experiments

Results

Paper #4 — STRUDEL: Structured Dialogue Summarization for Dialogue Comprehension by Meta AI

Motivation

Objective

Proposal:

Definition of Structured Dialogue Summarization (STRUDEL)

Approach

Experiments:

Results

Paper #5 — Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation by Melanie Sclar, Peter West, Sachin Kumar, Yulia Tsvetkov, Yejin Choi

Motivation

Objective

Research question

Approach

Experiments

Results

Paper #6 — Curriculum Prompt Learning with Self-Training for Abstractive Dialogue Summarization by Changqun Li, Linlin Wang, Xin Lin, Gerard de Melo, Liang He

Motivation

Objective

Approach

Experiments

Results

Written by Varun Nathan