Seq2seq NLG: The Good, the Bad and the Boring

Published in

Analytics Vidhya

12 min readDec 29, 2019

Clint Eastwood in Sergio Leone’s movie “The Good, the Bad, and the Ugly”

In Natural Language Generation (NLG), domain-dependent customized architectures and template-based approaches have long been reigning, and that’s because input data is heterogeneous, requiring cognitively and linguistically informed tasks to transform it into messages fit for communication. In addition, target texts let alone pairs of inputs and target texts are hard to come by.

Taking after Neural Machine Translation, sequence to sequence (seq2seq) models for NLG have come along, promising to generate, end-to-end, grammatical texts that preserve the style of the target texts whilst still conveying the input content.

The 2017 end-to-end (e2e) NLG challenge is a good illustration of that NLG seq2seq enthusiasm. The baseline itself is a seq2seq approach and many of the systems submitted for the competition or proposed afterwards are based on this model. A good synthesis of these approaches together with a detailed analysis of the dataset and evaluation can be found here.

In this article, I first introduce the keys elements of that challenge, that is its dataset, baseline, evaluation and some of the seq2seq strategies proposed to improve the baseline. I then introduce an implementation of basic seq2seq NLG using the Fastai code provided for Neural Machine Translation in the Fastai course on NLP that achieves better scores than the baseline wrt some of the word-based metrics. The Fastai code has the advantage that it is ready to use and implements some state of the art solutions to the problem of seq2seq learning. Thus one can focus on experimenting with different ideas.

The code and notebooks for this FastAai seq2seq NLG model and for pre-processing of the results for the evaluation is available on github here.

Dataset

The e2e NLG competition dataset consists of around 50k pairs of Meaning Representations (MRs) and short Natural Language (NL) restaurant descriptions, less than 10% of which is the validation set. +6k of the MRs are unique which means that one MR can be realized in different ways. The dataset has been organized so that the inputs are not repeated across the validation and test sets, even after removing the venue names.

Here is an example MR with its possible 6 different verbalizations:

MR:
name[The Eagle],
eatType[coffee shop],
food[English],
priceRange[more than £30],
customerRating[high],
area[riverside],
kidsFriendly[yes],
near[Burger King]
NLs:
(1) The Eagle is a highly rated, children friendly coffee shop near Burger King in riverside. It has English food and the price range is more than £30.
(2) The Eagle is an expensive family coffee shop that serves British food. It is located near a Burger King.
(3) There is a children friendly coffee shop called The Eagle located near the Burger King in the Riverside area, it receives high customer ratings and is moderately priced at around £30.
(4) The highly rated The Eagle English coffee shop is located in riverside near Burger King and has a price range of more than £30.
(5) There is an English coffee shop called The Eagle. It is near Burger King in the riverside area. It is children friendly and in the above £30 price range.
(6) The Eagle is in the Northern City Centre, on the South side of the River, near Burger King.

The meaning representation for this example expresses all the 8 possible attributes an MR can have. Two of these attributes are free text (name and near) and the rest have between two and six possible values.

The reference texts above have a very rich vocabulary and structure, ranging from one to three sentences. Also, many do not verbalize all the content. In fact, sentence (6) presents information that is not even in the MR.

In this regard, it is interesting to understand how the data was obtained as explained by the organizers of the challenge: it was crowdsourced with 80% cases presented using bracketed MRs as above (in random order so as not to influence the order in which information is presented in the text) and 20% cases using pictorial map MRs (see picture below). So the extra information in sentence (6) looks like it might have been obtained from the pictorial/map representation. Indeed using a pre-study where they compare pictorial vs textuals MRs, the authors reveal that “pictorial MRs elicit texts that are significantly less similar to the underlying MR in terms of semantic text similarity”, and that “compared to textual MRs, pictorial MRs elicit texts that are significantly less similar to the underlying MR in terms of semantic text similarity”.

Textual/Logical vs Pictorial/Map Meaning Representations (source)

According to the authors, these variations introduce noise that NLG systems must cope with. However, is there a criterion to decide whether to verbalize some meaning? For example, the price of more than £30 is mentioned in 5 out of 6 of the texts above, proximity to Burger King in all the texts, and high customer rating in only 3 texts. Also, the provenance of the different texts (author and pictorial vs textual MR) is not specified and so that makes it difficult to exploit this information.

Evaluation

To benchmark the systems, the e2e NLG competition organizers provide a baseline which is an attention-based seq2seq model that does data augmentation and beam search with reranking using a classifier that penalizes the outputs that stray away from the input. So their baseline itself is quite strong.

For evaluation of the output texts, three types of measures are used, namely, word-overlap metrics, textual complexity and diversity metrics, and human judgment. Below we briefly describe each of them, although we will only provide word-based metrics as our evaluation.

Word-overlap metrics

For evaluation of the output texts against the reference texts, a scoring script is made available which computes a number of word-overlap metrics, namely BLEU, Meteor, Rouge-L, CIDEr, and Nist. Here is a brief explanation of what they are:

BLEU: computes the harmonic mean of n-gram precision (where n is up to 4), lowered by a brevity penalty is output is shorter than reference. n-gram matching is performed against any of the reference texts.
CIDEr: computes the average cosine similarity between the system output and the reference sentences on the level of n-grams, n∈{1,…,4} weighted by the tf*idf score.
Meteor: measures precision and recall of unigrams wrt each of the human generated outputs of a given MR, selecting the best-matching one. In addition to exact word matches, it uses fuzzy matching based on stemming and WordNet synonyms.
Rouge-L: measures an f-score based on the precision and recall of the Longest Common Subsequence against any human references.
Nist: is a version of BLEU that gives more weight to rarer n-grams and less penalty to brevity.

So, some of the metrics such as BLEU, Meteor or Rouge-L factor in the multiple reference texts of a given meaning representation.

Textual complexity and diversity metrics

Measures of lexical and syntactic complexity on the reference texts show that those texts exhibit a high lexical and syntactic complexity and diversity. Lexical measures such as type to token ratio and entropy were calculated using Lexical Complexity Analyzer. Degrees of syntactic complexity were measured using the D-Level Analyzer, which assigns sentences a value on an 8-point scale of syntactic complexity. The results showed that whilst 46% of sentences are simple, 15% are of the highest 2 levels of complexity. On the other hand, it was shown that seq2seq systems generally have low syntactic complexity and poor lexical diversity.

Human judgement

Quantitative evaluation has its limits and is generally advised for comparing systems against one another. However, to evaluate the output quality of a text generation system, human judgment is essential although costly and not without its difficulties. For an enlightened view, please read Ehud Reiter’s blog article “How to do an NLG Evaluation: metrics”.

For this competition, two separate text quality judgment tasks were performed on crowd sourcers. On the one hand, they were asked to judge the naturalness of the texts presented together with their input MRs, answering the following question: “How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?”. On the other hand, crowd sourcers were asked to judge the quality of the output texts, without their MRs, answering the question: “Could the utterance have been produced by a native speaker?”

A box of Tools and Tricks

Looking through some of the rather large body of literature that this competition has produced, I found that seq2seq systems propose to improve on the baseline by either (1) tweaking the input dataset, (2) tweaking the architecture, or (3) tweaking the outputs, and any combinations of those three.

Tweaking the input

The first thing that is performed by most systems, seq2seq or otherwise, is to delexicalize the input, that is replace the free text inputs (i.e., current and nearby venue names) with some dummy value.

Another common approach performed by many systems is to do data augmentation, which takes many forms such as:

permutations of the attribute values of the meaning representations in the input,
oversampling input pairs by simply duplicating the inputs, or by adding single sentences matched with attribute-values using a slot aligner,
selective subsampling, by using references with highest average word frequency in order to augment diversity, or by using references with complex single sentences to get more sophisticated output.
synthetic text generation, by using a statistical generator to generate a large corpus of texts that vary in style according to psycholinguistic models of personality.

Tweaking the architecture

Many approaches have been proposed such as:

Ensembling different models trained on different neural architectures (using LSTMs vs CNNs) or trained with the same architecture on different clusters of sentences exhibiting similar style and structural properties.
Joint learning of NLG from MRs and MR NLU from texts in order to palliate lack of paired datasets.
Using a fixed set of attribute-values as input to the encoder.
Using a copy mechanism that allows to alternate between the standard generation mechanism and directly copying input facts to the output.

Tweaking the output

This is typically done with beam search to generate the k-best outputs, and then apply reranking by using either a classifier or a length penalty and coverage penalty that penalize those outputs that do not verbalise all of the input.

Implementation

As mentioned in the introduction, I reused the Fastai implementation of seq2seq presented during the Fastai NLP course. This comprises:

Fastai text bunch loading and preprocessing. This uses the spacy tokenizer and a minimum word frequency set to 3.
Input and output embeddings. In this implementation, I mapped each word in the meaning representation to a 300-dimensional embedding with the Fasttext pretrained model.
A 1-layer GRU encoder and decoder architecture with some default dropout values. I found a 1-layer architecture performed better than a 2-layer one, presumably because it involved less parameters. The number of neurons in the hidden layer was set to 128 and the batch size to 32. Contrary to most seq2seq solutions to the e2e NLG competition, I did not find that attention added value to the final results compared to the vanilla version.
Teacher forcing which feeds the decoder at step t with the true observation of step t-1 instead of the prediction at training time in order to avoid error propagation. This approach is mitigated with scheduled sampling which actually feeds a random mix of true observations and predictions, with true observations decreasing as training progresses.

I also performed delexicalization of the reference texts and meaning representations, and some basic cleaning of the input meaning representations before feeding them to the encoder, such as putting spaces around dashes and before a punctuation mark.

After preprocessing, cleaning and uploading of the MRs and texts, I end up with 56 unique tokens for the input and 1216 for the output.

I experimented with 4 different settings:

plain seq2seq with greedy search
template-based approach using the same basic template as provided here
plain seq2seq with greedy search and data augmentation
plain seq2seq with beam search and reranking

For data augmentation, I used 3 random permutations of the input for each pair as in here (without forgetting to up the minimum word frequency from 3 to 9 for the Fastai tokenize preprocessor to keep the same tokens).

For beam search, instead of a fixed k, I used neural sampling with a variable k which is conditioned on a probability p=0.3, as I found that gave more grammatical output than top-k (when p is kept low). For good measure, I also varied the softmax temperature (T=1). Both approaches (neural sampling and temperature) are known to obtain a more varied output.

To perform reranking, I used a multi-label classifier I developed on pairs of text-MRs using the Fastai ULMFit approach. This classifier achieved an F1-score of 90% on the test set for individual MR attribute-value assignment. Neural sampling and reranking is only applied if the greedy output does not realize all the MR attribute-values, according to the classifier and the neural sampling outputs are only considered if their F1 value is above the one of the greedy output.

Results

The two tables below show the evaluation of the seq2seq model wrt the competition’s baseline, template-based approach, and greedy search with and without augmentation for the development and test sets.

The results show that for the development set, the basic Fastai seq2seq implementation outperforms both template-based and baseline on all measures. The data augmentation does not improve the results on the development set, on the contrary.

On the test set, the basic Fastai seq2seq implementation outperforms the baseline on Meteor and Rouge. Greedy search with data augmentation gives better Bleu, Nist and Rouge scores than without it, and outperforms the baseline on Bleu and Rouge.

Here are some sample outputs (before “retokenization”) obtained with seq2seq and greedy search:

Sample outputs from seq2seq with greedy search from development set

The sample shows not all attribute-values are realized in the output column: row 688 is missing price, row 1211 is missing area,. Also some attribute values are not correctly realized: row 497 mentions a “high customer rating” when the customer rating in the input is “1 out of 5”.

Unfortunately, performing beam search with reranking using our classifier did not improve the results. In fact, it made them worse. Applying the classifier on all the greedy outputs of the development set reveals that over 93% of outputs achieve 100% precision and 67% achieve 100% recall. Given that many target texts don’t verbalize all the outputs, that might explain why improving recall might not have much effect on the performance. Also, verbalizing all content tends to make the texts longer which might also penalize the performance.

Last words

All in all, I found it was pretty impressive I could generate reasonably sounding short texts that verbalize most of the input using a sequential model without any of the usual NLG paraphernalia (ordering, sentence planning, lexicalization, linguistic realization, etc).

However, the resulting texts were linguistically not very varied (number of different words in the greedy output for the development set according to LCA was 53, which is close to the number of input tokens), contrary to the target texts of our domain. I could obtain more variety with a higher probability for neural sampling or a higher softmax temperature, but that degraded grammaticality and correctness of the output, i.e., led to what is referred to as neural hallucinations.

Thus, to go back to the title of the article we got “the good” target texts, “the bad” gobbledygook (love that word) outputs and “the boring” seq2seq outputs. This flat-style output is especially problematic for dialogue-based system and some works have tried to remedy this by incorporating style and stylistic features into the training.

Finally, a bit of honesty here. I spent a lot of time trying different options to try to get better results, such as:

different models (vanilla seq2seq, attention, transformers),
reranking with neural sampling at different values of p and experimenting with different softmax temperature,
teacher forcing with different ratios of true observations vs predictions,
selective subsampling (e.g., training with single sentence texts) or oversampling by just duplicating the data.

I also spent a fair amount of time modifying hyperparameters as validation loss, although decreasing, had a much higher value than training loss, which might have been a sign of overfitting. For this latter problem, I tried different values of dropouts, number of units in hidden layer, batch size, and weight decay.