Natural Language Generation (NLG) @ EMNLP 2022

Published in

The Observe.AI Tech Blog

10 min readFeb 27, 2023

The current study aims to present a succinct overview of the latest research trends in natural language generation (NLG) by providing a brief synopsis of the papers accepted at EMNLP 2022. The objective is to capture the problem space, research objectives, experimental design, obtained results, and the fundamental ideas underlying the methodologies used in each of the papers without delving into their technical details. The primary motivation is to offer readers a comprehensive understanding of the recent advancements in NLG research.

Paper highlights

Paper #1 — RankGen: Improving Text Generation with Large Ranking Models by Kalpesh Krishna, Yapei Chang, John Wieting, Mohit Iyyer

Motivation

Repetition and lack of faithfulness are the two primary challenges associated with text generation using large language models
Generation is very sensitive to the decoding algorithm. Despite the low perplexity of language models (LMs), text decoding is a challenge
LMs need to be trained on much larger data and should be much bigger to make them agnostic to the decoding strategy used
Wrong training objective — we want to generate multiple tokens of text at inference but we are only training on next word prediction
Wrong decoding / search strategy

Approach

Termed “RankGen”, it can be thought of as “k word language modelling”
A new LM training objective is proposed which can be used as a decoding strategy with any existing causal LM
The idea is to improve the ability of current LMs for the task of suffix ranking
An experiment was carried out to evaluate GPT3 and other LMs against humans for the task of suffix ranking for a given prefix of 256 tokens. GPT3 scored 78.2% on the task while humans scored 94.5%
RankGen was built by prepending a special token — “PRE” for prefixes and “SUF” for suffixes and encoding them by passing them through an encoder
The encoder is trained via contrastive learning with positive / negative pairs generated

Experiments

Project Gutenburg / Wikipedia dataset was used for the experiments
T5-XL was used as the encoder
Model sizes experimented with include Base — 110.2 M, Large — 342.3 M and XL — 1.3 B
MAUVE was used for evaluation
Nucleus and Ancestral were used for sampling
Applications: Code generation, retrieval augmentation generation and supervise LMs

Results

RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling, as well as contrastive decoding and search, on both automatic metrics (85.0 vs 77.3 MAUVE over nucleus) and human evaluations with english writers (74.5% human preference over nucleus sampling)

Paper #2 — Memsizer: Linearizing Transformer with Key-Value Memory by Yizhe Zhang, Deng Cai

Motivation

Despite the great success of Transformer model which has become the de facto standard for almost all NLP tasks, it has several limitations which are as follows:

Computationally expensive — computation of the attention mechanism scales quadratically with the sequence length
Limits the efficient deployment of large-scale pre-trained models such as GPT-3

The factors that contribute to the computational efficiency of monolithic transformers include:

Time complexity — Quadratic to length (Vs linear in RNNs)
Linear in memory footprint (Vs constant in RNNs)
Generation cannot be parallelized (vs training)

Research question

Can we have RNN like efficiency while retaining the performance of Transformers

Approach

Termed “Memsizer”, the key idea is to replace the Standard Attention (SA) module in the vanilla transformer with a different memory mechanism which achieves recurrent inference computation thus linear complexity
The proposed memory mechanism comes with a different specification of query, key and value in SA
The proposed key-value memory mechanism is unbalanced which enables learning better input-dependent values to match with input-independent keys
The key-value memory layer in MemSizer contains k memory slots
The value matrix summarizes the source information into a fixed-sized space Rk×d regardless of the source length M. This is done by using a linear kernel (XT X) which is applied to the source input Xs to cancel out the length dimension M, so that M is not required to be preset
The model can be made more expressive with multihead specification, where the value matrix (V) is shared across (r) different heads but use a distinct key matrix (K) for each head

Experiments

Data: CNN/DM, XSum and Wikitext103
Evaluation task: Machine Translation, Summarization and Language modeling
Evaluation metrics: Decoding Speed in tokens/sec, Peak memory usage, Model Size and Performance measured in ROUGE-1, ROUGE-2 & ROUGE-L
Models compared against — Linear and Recurrent Transformer variants

Results

Decoding speed measured in tokens/sec is significantly higher for Memsizer compared to the vanilla transformer
Performance of Memsizer is more or less similar to that of the vanilla transformer
Model size of Memsizer is 26% smaller than that of the vanilla transformer
Memory utilization of Memsizer is 30% lower than that of the vanilla transformer

Advantages of Memsizer

Linear time complexity
O(1) memory complexity
Higher decoding speed without a drop in performance. Model size is also lower.
Recurrent style computation

Paper #3 — On the Limitations of Reference-Free Evaluations of Generated Text by Daniel Deutsch, Rotem Dror and Dan Roth

Motivation

Automatically evaluating the quality of generated texts is essential for the development of natural language generation systems
Reference texts are expensive to collect or entirely unavailable
Need to estimate the quality of text in real time
Increased interest in reference-free metrics
High correlations of reference-free metrics to human judgments suggest that it is a promising direction of future research

Experiments

Reference-free metrics analyzed in this work: Prism-src (Thompson and Post, 2020), COMET-QE (Rei et al., 2021) and QuestEval (Scialom et al., 2021)
Reference-based metrics used in experiments: BLEU, ROUGE, BERTScore, QAEval and BLEURT
Dataset: WMT’19 metrics shared task was used for Machine Translation and SummEval (Fabbri et al., 2021) & REALSumm (Bhandari et al., 2020) were used for Summarization
Metric optimization methods: Direct optimization, Greedy Optimization for Extractive Summarization and Reranking

Findings

Reference-free evaluation metrics are inherently biased and limited in their ability to evaluate generated text
Reference-free metrics should not be used to measure progress on tasks like machine translation or summarization
Reference-free metrics are equivalent to using one generation model to evaluate another which has several limitations
Limitation 1: Reference-free metrics can be optimized at test time to find the approximate best-possible output
Limitation 2: Reference-free metrics can be biased against higher-quality outputs, including those written by humans
Recommendation: Reference-free metrics should be used as diagnostic tools for analyzing and understanding model behavior instead of measures of how well models perform a task

Paper #4 — FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue by Google Research

Motivation

Task transfer which dwells with transferring knowledge contained in related tasks is useful for reducing the quantity of labeled data required to fine-tune language models

Objective

Explores conversational task transfer by introducing FETA: a benchmark for FEw-sample TAsk transfer in open-domain dialogue
FETA enables intra-dataset task transfer — task transfer without domain adaptation

Key Contribution

Create first large-scale benchmark for task transfer in dialogue, with 132 source-target task pairs
Extensive experimentation on FETA in both the single-source and multi-source settings, and an in-depth analysis comparing models, learning algorithms, sample sizes, and task types, finding new and non-intuitive results
A readily extensible transfer learning framework (https://github.com/alon-albalak/TLiDB) that allows for rapid experimentation and an online leaderboard (https://alon-albalak.github.io/feta-website/) to encourage deeper research into task transfer

Experiments

Dialogue sources of FETA datasets: DailyDialog (Li et al., 2017) and Friends (Chen and Choi, 2016)
Variables considered for evaluation include source task, target task, model, and learning algorithm. Performance is studied across these 4 variables. And, average and top-1 raw scores, as well as average and top-1 score ∆s were used as evaluation metrics.
Task Transfer Algorithms: Pre-train/Fine-tune, Multitask and Multitask/Fine-tune

Results

Scores are averaged across both dialogue datasets and they find that pre-train/fine-tune gets an average score of 42.85, multitask 42.84, and multitask/fine-tune 44.07
Multitask/fine-tune achieves the best average score for all models and datasets, and its average score is a 2.8% improvement over the other algorithms
Trends vary depending on the model. Fine-tuning on the target task always benefits the T5 model while fine-tuning on the target task does not always work for BERT and GPT-2, which achieve better scores from multitasking than pre-train/fine-tune
Trends on individual tasks also vary depending on the model
Multiple-choice tasks give the greatest benefit as source tasks, especially when the ratio of source-to-target samples is low
GPT-2 and T5 have opposite trends in relation to sample size. They find that ∆s for GPT-2 increase with high target samples and decrease with high source samples. The exact opposite trend is observed for T5 suggesting that it is more sample efficient than both GPT-2 and BERT.

Paper #5 — CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning by Google Research

Motivation

Passage retrieval for conversational question answering (CQA) poses new challenges in understanding the user question as it’s contextual
It is expensive to retrain well-established retrievers such as search engines that are originally developed for non-conversational queries
Prior works on QR for conversational passage retrieval focussed on using human-rewritten queries to train a supervised QR model which does not necessarily align with the goal in our task as they are usually instructed to rewrite conversational queries to be unambiguous to a human outside the dialogue context in contrast to optimizing the retrieval performance. In addition, supervised QR models are agnostic to downstream retrievers as they are separately trained

Objective

To develop a query rewriting model (CONQRR) that rewrites a conversational question in the context into a standalone question in order to facilitate the use of well-established retrievers such as search engines

Proposal

Reinforcement learning (RL) — based model CONQRR
Directly optimizes the rewritten query towards retrieval performance, using only weak supervision from retrieval
They adopt a novel reward function that computes an approximate but effective retrieval performance metric on in-batch passages at each training step
Their proposed reward function does not assume any specific retriever model design, and is generic enough for CONQRR to adapt to any off-the-shelf retriever

Experiments

Data: QReCC (Anantha et al., 2021)
Retriever models: BM25 and T5 Dual Encoder (DE)
Evaluation Metrics: Mean Reciprocal Rank (MRR), Recall@10 and Recall@100
Compared Systems: GPT2 with weak supervision (WS) (Yu et al., 2020), T5QR (Lin et al., 2020) and Transformer++

Results

CONQRR outperforms existing QR models on a recent large-scale open-domain CQA dataset QReCC (Anantha et al., 2021) by over 12% and 14% for BM25 and a neural dual encoder retriever model (Ni et al., 2021) respectively, averaging over three retrieval metrics
CONQRR trained with no human rewrite supervision provides better retrieval results than strong baselines trained with full supervision, and is robust to out-of-domain dialogues, topic shifts and long dialogue contexts
We conduct a novel quantitative study to analyze the limitations and utility of human rewrites in retrieval performance, which are largely unexplored in prior work

Paper #6 — EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start by Google Research

Motivation

Model sizes have been growing and so is the inference latency and costs
Large models are deployed by distilling knowledge into student models
Student networks are often scaled down teachers

Research question

Can inductive bias make inference quicker without compromising quality

What is Text Editing

Generates natural language by applying edit operations to the input text to produce the target text efficiently and effectively

Text Editing Requirements

Most NLG tasks are monolingual
Sources and targets often overlap which implies that generating the target from scratch is wasteful and that most of the target can be reconstructed from the source via basic ops like KEEP, DELETE, INSERT

NLG Tasks for which Text Editing is Suitable

Sentence fusion
Grammatical error correction
Summarization — may be
Machine Translation — No as there is very little overlap between source and target text

Motivation for EdiT5

Data efficiency: EdiT5 needs less training data
Latency: Offers >25x faster inference
Faithfulness: Minimizes the number of new tokens
Pre-training: Initialized from T5 checkpoints. Further trained with text-editing specific pre-training

What is EdiT5

Single end-to-end text editing model initialized with T5
Fast and productionizable
Supports reordering and uses an open vocabulary — can generate any output
Non-autoregressive tagger and re-ordering network which produces the bulk of the text
Autoregressive decoder which generates tokens not found in the text

Latency savings by EdiT5

Reduces the number of decoder steps as EdiT5 only decodes new tokens and reorders non-autoregressively if needed
Reduces the latency per step as EdiT5 uses a single layer decoder

Results

EdiT5 exhibits comparable or better results across data conditions for the task of Sentence Fusion. EdiT5 is better than T5-base in the low resource setting. In addition, T5 latency is 53 ms with 41 tokens inserted on average while EdiT5 latency is 2 ms with 6 tokens inserted on average
EdiT5 exhibits better results than T5-base with significantly better latency for the task of Decontextualization. T5 latency is 75 ms with 49 tokens inserted on average while EdiT5 latency is 2 ms with 7 tokens inserted on average
EdiT5 exhibits comparable results across model sizes for the task of Grammatical Error Correction. T5 inserts 25 tokens on average while EdiT5 inserts 5 tokens on average.

Paper #7 — Towards a Unified Multi-Dimensional Evaluator for Text Generation (UniEval) by Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, Jiawei Han

Motivation

Multi-dimensional evaluation is the dominant paradigm for human evaluation in NLG
Automatic evaluation in NLG is still dominated by similarity-based metrics
Lack a reliable framework for a more comprehensive evaluation of advanced models

Objective

To develop a unified evaluator that can evaluate all dimensions in a generation task

UniEval Framework

It has two key stages viz. Intermediate multi-task learning and Unsupervised learning on multiple dimensions of evaluation tasks
The motivation for the first stage is to incorporate external knowledge related to evaluation and get familiar with the boolean QA format
Four types of intermediate tasks were used in the paper: NLI, Linguistic related task, Self supervised task and Generic QA
In the second stage, pseudo data is initially constructed by considering the reference text as positive samples and using rule based transformation for creating the negative samples. Operations like deletion, shuffling, repetition etc. are used for the negative sample creation
Post the curation of pseudo data, models are trained for different dimensions viz. fluency, coherence, consistency and relevance via two strategies viz. Multi-task learning and Continual learning

Experiments

Datasets: SummEval (Fabbri et al., 2021) was used for the summarization task, Topical-Chat (Mehri and Eskenazi, 2020) was used for the dialogue response generation task and SFRES & SFHOT (Wen et al., 2015) was used for the data-to-text task
Backbone Model — “google/t5-v1_1-large” version of T5
Number of pseudo samples for each dimension: 30k
Baseline Evaluation Metrics: BERTScore (Zhang et al., 2019), MoverScore (Zhao et al., 2019), CTC (Deng et al., 2021), BARTScore (Yuan et al., 2021) and USR (Mehri and Eskenazi, 2020)

Results

Correlates substantially better with human judgments than existing metrics
Achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation
Demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks

Natural Language Generation (NLG) @ EMNLP 2022

Paper highlights

Paper #1 — RankGen: Improving Text Generation with Large Ranking Models by Kalpesh Krishna, Yapei Chang, John Wieting, Mohit Iyyer

Motivation

Approach

Experiments

Results

Paper #2 — Memsizer: Linearizing Transformer with Key-Value Memory by Yizhe Zhang, Deng Cai

Motivation

Research question

Approach

Experiments

Results

Advantages of Memsizer

Paper #3 — On the Limitations of Reference-Free Evaluations of Generated Text by Daniel Deutsch, Rotem Dror and Dan Roth

Motivation

Experiments

Findings

Paper #4 — FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue by Google Research

Motivation

Objective

Key Contribution

Experiments

Results

Paper #5 — CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning by Google Research

Motivation

Objective

Proposal

Experiments

Results

Paper #6 — EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start by Google Research

Motivation

Research question

What is Text Editing

Text Editing Requirements

NLG Tasks for which Text Editing is Suitable

Motivation for EdiT5

What is EdiT5

Latency savings by EdiT5

Results

Paper #7 — Towards a Unified Multi-Dimensional Evaluator for Text Generation (UniEval) by Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, Jiawei Han

Motivation

Objective

UniEval Framework

Experiments

Results

Written by Varun Nathan