Natural Language Generation (NLG) @ EMNLP 2022

Varun Nathan
The Observe.AI Tech Blog
10 min readFeb 27, 2023

The current study aims to present a succinct overview of the latest research trends in natural language generation (NLG) by providing a brief synopsis of the papers accepted at EMNLP 2022. The objective is to capture the problem space, research objectives, experimental design, obtained results, and the fundamental ideas underlying the methodologies used in each of the papers without delving into their technical details. The primary motivation is to offer readers a comprehensive understanding of the recent advancements in NLG research.

Paper highlights

Paper #1 — RankGen: Improving Text Generation with Large Ranking Models by Kalpesh Krishna, Yapei Chang, John Wieting, Mohit Iyyer

Motivation

  • Repetition and lack of faithfulness are the two primary challenges associated with text generation using large language models
  • Generation is very sensitive to the decoding algorithm. Despite the low perplexity of language models (LMs), text decoding is a challenge
  • LMs need to be trained on much larger data and should be much bigger to make them agnostic to the decoding strategy used
  • Wrong training objective — we want to generate multiple tokens of text at inference but we are only training on next word prediction
  • Wrong decoding / search strategy

Approach

  • Termed “RankGen”, it can be thought of as “k word language modelling
  • A new LM training objective is proposed which can be used as a decoding strategy with any existing causal LM
  • The idea is to improve the ability of current LMs for the task of suffix ranking
  • An experiment was carried out to evaluate GPT3 and other LMs against humans for the task of suffix ranking for a given prefix of 256 tokens. GPT3 scored 78.2% on the task while humans scored 94.5%
  • RankGen was built by prepending a special token — “PRE” for prefixes and “SUF” for suffixes and encoding them by passing them through an encoder
  • The encoder is trained via contrastive learning with positive / negative pairs generated

Experiments

  • Project Gutenburg / Wikipedia dataset was used for the experiments
  • T5-XL was used as the encoder
  • Model sizes experimented with include Base — 110.2 M, Large — 342.3 M and XL — 1.3 B
  • MAUVE was used for evaluation
  • Nucleus and Ancestral were used for sampling
  • Applications: Code generation, retrieval augmentation generation and supervise LMs

Results

RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling, as well as contrastive decoding and search, on both automatic metrics (85.0 vs 77.3 MAUVE over nucleus) and human evaluations with english writers (74.5% human preference over nucleus sampling)

Paper #2 — Memsizer: Linearizing Transformer with Key-Value Memory by Yizhe Zhang, Deng Cai

Motivation

Despite the great success of Transformer model which has become the de facto standard for almost all NLP tasks, it has several limitations which are as follows:

  • Computationally expensive — computation of the attention mechanism scales quadratically with the sequence length
  • Limits the efficient deployment of large-scale pre-trained models such as GPT-3

The factors that contribute to the computational efficiency of monolithic transformers include:

  • Time complexity — Quadratic to length (Vs linear in RNNs)
  • Linear in memory footprint (Vs constant in RNNs)
  • Generation cannot be parallelized (vs training)

Research question

Can we have RNN like efficiency while retaining the performance of Transformers

Approach

  • Termed “Memsizer”, the key idea is to replace the Standard Attention (SA) module in the vanilla transformer with a different memory mechanism which achieves recurrent inference computation thus linear complexity
  • The proposed memory mechanism comes with a different specification of query, key and value in SA
  • The proposed key-value memory mechanism is unbalanced which enables learning better input-dependent values to match with input-independent keys
  • The key-value memory layer in MemSizer contains k memory slots
  • The value matrix summarizes the source information into a fixed-sized space Rk×d regardless of the source length M. This is done by using a linear kernel (XT X) which is applied to the source input Xs to cancel out the length dimension M, so that M is not required to be preset
  • The model can be made more expressive with multihead specification, where the value matrix (V) is shared across (r) different heads but use a distinct key matrix (K) for each head

Experiments

  • Data: CNN/DM, XSum and Wikitext103
  • Evaluation task: Machine Translation, Summarization and Language modeling
  • Evaluation metrics: Decoding Speed in tokens/sec, Peak memory usage, Model Size and Performance measured in ROUGE-1, ROUGE-2 & ROUGE-L
  • Models compared against — Linear and Recurrent Transformer variants

Results

  • Decoding speed measured in tokens/sec is significantly higher for Memsizer compared to the vanilla transformer
  • Performance of Memsizer is more or less similar to that of the vanilla transformer
  • Model size of Memsizer is 26% smaller than that of the vanilla transformer
  • Memory utilization of Memsizer is 30% lower than that of the vanilla transformer

Advantages of Memsizer

  • Linear time complexity
  • O(1) memory complexity
  • Higher decoding speed without a drop in performance. Model size is also lower.
  • Recurrent style computation

Paper #3 — On the Limitations of Reference-Free Evaluations of Generated Text by Daniel Deutsch, Rotem Dror and Dan Roth

Motivation

  • Automatically evaluating the quality of generated texts is essential for the development of natural language generation systems
  • Reference texts are expensive to collect or entirely unavailable
  • Need to estimate the quality of text in real time
  • Increased interest in reference-free metrics
  • High correlations of reference-free metrics to human judgments suggest that it is a promising direction of future research

Experiments

  • Reference-free metrics analyzed in this work: Prism-src (Thompson and Post, 2020), COMET-QE (Rei et al., 2021) and QuestEval (Scialom et al., 2021)
  • Reference-based metrics used in experiments: BLEU, ROUGE, BERTScore, QAEval and BLEURT
  • Dataset: WMT’19 metrics shared task was used for Machine Translation and SummEval (Fabbri et al., 2021) & REALSumm (Bhandari et al., 2020) were used for Summarization
  • Metric optimization methods: Direct optimization, Greedy Optimization for Extractive Summarization and Reranking

Findings

  • Reference-free evaluation metrics are inherently biased and limited in their ability to evaluate generated text
  • Reference-free metrics should not be used to measure progress on tasks like machine translation or summarization
  • Reference-free metrics are equivalent to using one generation model to evaluate another which has several limitations
  • Limitation 1: Reference-free metrics can be optimized at test time to find the approximate best-possible output
  • Limitation 2: Reference-free metrics can be biased against higher-quality outputs, including those written by humans
  • Recommendation: Reference-free metrics should be used as diagnostic tools for analyzing and understanding model behavior instead of measures of how well models perform a task

Paper #4 — FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue by Google Research

Motivation

Task transfer which dwells with transferring knowledge contained in related tasks is useful for reducing the quantity of labeled data required to fine-tune language models

Objective

  • Explores conversational task transfer by introducing FETA: a benchmark for FEw-sample TAsk transfer in open-domain dialogue
  • FETA enables intra-dataset task transfer — task transfer without domain adaptation

Key Contribution

  • Create first large-scale benchmark for task transfer in dialogue, with 132 source-target task pairs
  • Extensive experimentation on FETA in both the single-source and multi-source settings, and an in-depth analysis comparing models, learning algorithms, sample sizes, and task types, finding new and non-intuitive results
  • A readily extensible transfer learning framework (https://github.com/alon-albalak/TLiDB) that allows for rapid experimentation and an online leaderboard (https://alon-albalak.github.io/feta-website/) to encourage deeper research into task transfer

Experiments

  • Dialogue sources of FETA datasets: DailyDialog (Li et al., 2017) and Friends (Chen and Choi, 2016)
  • Variables considered for evaluation include source task, target task, model, and learning algorithm. Performance is studied across these 4 variables. And, average and top-1 raw scores, as well as average and top-1 score ∆s were used as evaluation metrics.
  • Task Transfer Algorithms: Pre-train/Fine-tune, Multitask and Multitask/Fine-tune

Results

  • Scores are averaged across both dialogue datasets and they find that pre-train/fine-tune gets an average score of 42.85, multitask 42.84, and multitask/fine-tune 44.07
  • Multitask/fine-tune achieves the best average score for all models and datasets, and its average score is a 2.8% improvement over the other algorithms
  • Trends vary depending on the model. Fine-tuning on the target task always benefits the T5 model while fine-tuning on the target task does not always work for BERT and GPT-2, which achieve better scores from multitasking than pre-train/fine-tune
  • Trends on individual tasks also vary depending on the model
  • Multiple-choice tasks give the greatest benefit as source tasks, especially when the ratio of source-to-target samples is low
  • GPT-2 and T5 have opposite trends in relation to sample size. They find that ∆s for GPT-2 increase with high target samples and decrease with high source samples. The exact opposite trend is observed for T5 suggesting that it is more sample efficient than both GPT-2 and BERT.

Paper #5 — CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning by Google Research

Motivation

  • Passage retrieval for conversational question answering (CQA) poses new challenges in understanding the user question as it’s contextual
  • It is expensive to retrain well-established retrievers such as search engines that are originally developed for non-conversational queries
  • Prior works on QR for conversational passage retrieval focussed on using human-rewritten queries to train a supervised QR model which does not necessarily align with the goal in our task as they are usually instructed to rewrite conversational queries to be unambiguous to a human outside the dialogue context in contrast to optimizing the retrieval performance. In addition, supervised QR models are agnostic to downstream retrievers as they are separately trained

Objective

To develop a query rewriting model (CONQRR) that rewrites a conversational question in the context into a standalone question in order to facilitate the use of well-established retrievers such as search engines

Proposal

  • Reinforcement learning (RL) — based model CONQRR
  • Directly optimizes the rewritten query towards retrieval performance, using only weak supervision from retrieval
  • They adopt a novel reward function that computes an approximate but effective retrieval performance metric on in-batch passages at each training step
  • Their proposed reward function does not assume any specific retriever model design, and is generic enough for CONQRR to adapt to any off-the-shelf retriever

Experiments

  • Data: QReCC (Anantha et al., 2021)
  • Retriever models: BM25 and T5 Dual Encoder (DE)
  • Evaluation Metrics: Mean Reciprocal Rank (MRR), Recall@10 and Recall@100
  • Compared Systems: GPT2 with weak supervision (WS) (Yu et al., 2020), T5QR (Lin et al., 2020) and Transformer++

Results

  • CONQRR outperforms existing QR models on a recent large-scale open-domain CQA dataset QReCC (Anantha et al., 2021) by over 12% and 14% for BM25 and a neural dual encoder retriever model (Ni et al., 2021) respectively, averaging over three retrieval metrics
  • CONQRR trained with no human rewrite supervision provides better retrieval results than strong baselines trained with full supervision, and is robust to out-of-domain dialogues, topic shifts and long dialogue contexts
  • We conduct a novel quantitative study to analyze the limitations and utility of human rewrites in retrieval performance, which are largely unexplored in prior work

Paper #6 — EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start by Google Research

Motivation

  • Model sizes have been growing and so is the inference latency and costs
  • Large models are deployed by distilling knowledge into student models
  • Student networks are often scaled down teachers

Research question

Can inductive bias make inference quicker without compromising quality

What is Text Editing

Generates natural language by applying edit operations to the input text to produce the target text efficiently and effectively

Text Editing Requirements

  • Most NLG tasks are monolingual
  • Sources and targets often overlap which implies that generating the target from scratch is wasteful and that most of the target can be reconstructed from the source via basic ops like KEEP, DELETE, INSERT

NLG Tasks for which Text Editing is Suitable

  • Sentence fusion
  • Grammatical error correction
  • Summarization — may be
  • Machine Translation — No as there is very little overlap between source and target text

Motivation for EdiT5

  • Data efficiency: EdiT5 needs less training data
  • Latency: Offers >25x faster inference
  • Faithfulness: Minimizes the number of new tokens
  • Pre-training: Initialized from T5 checkpoints. Further trained with text-editing specific pre-training

What is EdiT5

  • Single end-to-end text editing model initialized with T5
  • Fast and productionizable
  • Supports reordering and uses an open vocabulary — can generate any output
  • Non-autoregressive tagger and re-ordering network which produces the bulk of the text
  • Autoregressive decoder which generates tokens not found in the text

Latency savings by EdiT5

  • Reduces the number of decoder steps as EdiT5 only decodes new tokens and reorders non-autoregressively if needed
  • Reduces the latency per step as EdiT5 uses a single layer decoder

Results

  • EdiT5 exhibits comparable or better results across data conditions for the task of Sentence Fusion. EdiT5 is better than T5-base in the low resource setting. In addition, T5 latency is 53 ms with 41 tokens inserted on average while EdiT5 latency is 2 ms with 6 tokens inserted on average
  • EdiT5 exhibits better results than T5-base with significantly better latency for the task of Decontextualization. T5 latency is 75 ms with 49 tokens inserted on average while EdiT5 latency is 2 ms with 7 tokens inserted on average
  • EdiT5 exhibits comparable results across model sizes for the task of Grammatical Error Correction. T5 inserts 25 tokens on average while EdiT5 inserts 5 tokens on average.

Paper #7 — Towards a Unified Multi-Dimensional Evaluator for Text Generation (UniEval) by Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, Jiawei Han

Motivation

  • Multi-dimensional evaluation is the dominant paradigm for human evaluation in NLG
  • Automatic evaluation in NLG is still dominated by similarity-based metrics
  • Lack a reliable framework for a more comprehensive evaluation of advanced models

Objective

To develop a unified evaluator that can evaluate all dimensions in a generation task

UniEval Framework

  • It has two key stages viz. Intermediate multi-task learning and Unsupervised learning on multiple dimensions of evaluation tasks
  • The motivation for the first stage is to incorporate external knowledge related to evaluation and get familiar with the boolean QA format
  • Four types of intermediate tasks were used in the paper: NLI, Linguistic related task, Self supervised task and Generic QA
  • In the second stage, pseudo data is initially constructed by considering the reference text as positive samples and using rule based transformation for creating the negative samples. Operations like deletion, shuffling, repetition etc. are used for the negative sample creation
  • Post the curation of pseudo data, models are trained for different dimensions viz. fluency, coherence, consistency and relevance via two strategies viz. Multi-task learning and Continual learning

Experiments

  • Datasets: SummEval (Fabbri et al., 2021) was used for the summarization task, Topical-Chat (Mehri and Eskenazi, 2020) was used for the dialogue response generation task and SFRES & SFHOT (Wen et al., 2015) was used for the data-to-text task
  • Backbone Model — “google/t5-v1_1-large” version of T5
  • Number of pseudo samples for each dimension: 30k
  • Baseline Evaluation Metrics: BERTScore (Zhang et al., 2019), MoverScore (Zhao et al., 2019), CTC (Deng et al., 2021), BARTScore (Yuan et al., 2021) and USR (Mehri and Eskenazi, 2020)

Results

  • Correlates substantially better with human judgments than existing metrics
  • Achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation
  • Demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks

--

--

Varun Nathan
The Observe.AI Tech Blog
0 Followers

Machine Learning Scientist | Applied Researcher | IISc Alumnus | Technical Writing