The Generative Style Transformer

Akhilesh Sudhakar
Dec 19, 2019 · 7 min read

This post explains our paper on style transfer, “Transforming Delete, Retrieve, Generate Approach for Controlled Text Style Transfer” (presented at EMNLP 2019) in a more accessible manner.

The paper can be found here. Our Github repo contains code and instructions for replicability.

The What

Our paper proposes a method to perform text style transfer. ‘Style’ of text is a term that the Natural Language Processing (NLP) community has borrowed from socio-linguistics, and uses rather loosely. Style transfer involves re-writing text of a certain style into a new target style, such that only the style of the text is changed while retaining its core meaning (or style-independent content). We show the following types of style transfer in the paper: a) re-writing reviews from positive to negative sentiment and vice versa (YELP and AMAZON datasets), b) re-writing factual sentences to humorous and romantic ones (CAPTIONS dataset), c) re-writing democrat-written political social media posts to republican-written ones and vice versa (POLITICAL dataset), and d) re-writing reviews written by females to male-written ones and vice versa (GENDER dataset). Thus, in our work, the notion of style varies from sentiment to humor and romance, to political slant and to gendered-ness.

The Why (Research POV)

Why is the NLP community interested in style transfer?

The NLP research community is interested in style transfer methods for a variety of applications. One set of these is targeted at anonymizing identity online, for instance, by obfuscating the author’s gender in social media posts. Doing so could prevent targeted gender-based advertising, abuse of privacy, and discrimination. The same approach can be extended to other attributes of the user — age, race, geography, etc. Another set of applications are in dialogue generation for conversations. To convey the same message during a conversation, humans introduce variations based on the context. This is modeled as style transfer from a message to its appropriate variation. Other works use style transfer to generate poetry and to re-write formal texts to informal and vice-versa.

The Why (Agara POV)

Why are we at Agara interested in style transfer?

We build conversational systems that hold conversations in different customer support related topics, across diverse call contexts. Style transfer is important to us for two reasons.

One of them is that Natural Language Generation (NLG) itself is a hard problem, and state-of-art NLG systems (such as those for summarization and dialogue generation) are nowhere close to the human-level when they attempt to generate text from scratch. Given that these systems are data-hungry deep learning models (hello, big Transformers), data sparsity is one of the main reasons, if not THE main reason, that these models perform poorly.

The question we ask is: do our conversation models really need to generate utterances from scratch? What if we could a) ‘retrieve’ an utterance spoken in a similar context by a human agent, similar to what we intend to generate, and b) re-write (a.k.a. style transfer!) the retrieved utterance to the desired utterance?

This would result in better quality generated utterances since the model now has to learn the easier task of only making edits over the retrieved utterance vs. generating it from scratch. Also, since a good part of the retrieved human utterance will be retained as is, there is a lesser scope for generating malformed utterances.

The second reason style transfer is useful to us is that it allows us to control how our conversational agent conveys the same message, according to the context of the call. Does the customer sound vexed about a problem they have been facing for a while, and need to be reassured that a solution will be arrived at? Are they in a situation where they need to be spoken to empathetically? Or do they need to be informed politely that their request cannot be fulfilled at the moment? The same message to be conveyed can be adapted to these different cases using style transfer.

The Data

Back to the paper.

Since there isn’t any dataset available that pairs the same sentences but of different styles with each other, i.e., a parallel corpus, we approach this task in an unsupervised fashion in our paper. What we do have though, are datasets that contain a set of example sentences for each style. We leverage these datasets.

The How

First, we note that the style of a sentence is localized to a small subset of words of a sentence. We call these words attributes and the rest of the words, content. Let’s take an example from here onwards just to make things simpler.

  • Source sentence: the pizza was tasty
  • Content: the pizza was.
  • Attribute(s): tasty

We model style transfer in the Delete-Retrieve-Generate framework (Li et al., 2018). This framework:

  1. Deletes only the set of attribute words from the source sentence to give the content (the pizza was ̶t̶a̶s̶t̶y̶)
  2. Retrieves attributes from the target style corpus (horrible), and
  3. Generates the output sentence from the content and retrieved attributes (the pizza was horrible)

For 1., we train a Delete Transformer (DT), which is a BERT classifier that is trained to predict the style of a given sentence. This training is possible because of the datasets we have. For a given source sentence (the pizza was tasty), we use the attention weights of the classifier to decide which words are attribute words. Since attribute words contribute to deciding the style of the sentence, the trained classifier pays higher attention to them. Hence, words that receive high attention weights when fed to the classifier, can be treated as attributes (tasty). These attributes are removed from the source sentence to give the content (the pizza was).

For 2., we find the closest sentence in the target style’s dataset (the sushi was horrible), based on how similar its content (the sushi was) is to the source’s content (the pizza was). The attributes of this closest sentence from the target corpus, are our retrieved attributes (horrible).

For 3., we propose and train the Generative Style Transformer (GST). GST initially has the same architecture and pre-training as the GPT and is thus, initially a powerful decoder-only language model.

The Models

GST is then trained in 2 variants — B-GST (Blind GST) and G-GST (Guided GST) — 2 separate models that are independently trained and whose outputs are compared with each other. Both take as input the source sentence’s content (the pizza was) and generate style-transferred output (the pizza was horrible). They differ in the additional inputs each of them takes, apart from the source’s content:

  1. B-GST additionally takes as input: the target style (<negative>).
  2. G-GST additionally takes as input: the retrieved attributes (horrible).

The figure below shows the generation using G-GST. B-GST is similar except that it does not have a retrieve component. [ATTRS] is a special token representing the start of the attributes section, [CONT_START] represents the start of the content section, and [START] indicates to the model that it has to start generating the output.

Figure 1 from our paper. Generation using G-GST, with an example of sentiment transfer.

Why These Variants?

B-GST is a model that generates output by placing the most appropriate target attributes in the given content, which it learns on its own from the distribution of the dataset. This variant is useful when there aren’t enough examples or enough attribute-coverage in the target style to find the closest match to ‘retrieve’ attributes from.

G-GST, on the other hand, is a model that generates output by placing the given attributes in the given content. This variant is useful when explicit control is required over the attributes to be used in the output, and when a good ‘retrieval’ mechanism can make it simpler for GST to generate good outputs.

How are the GSTs trained if there’s no parallel corpus?

By reconstruction. Since there’s no parallel corpus to train the GSTs on, they are simply trained to reconstruct the input. This method is possible, it works and it generalizes well to inference time because both GSTs take the source content as input and not the entire source sentence as is.

Sample Outputs

Below are sample outputs from the paper. SRC is the source sentence. SE, D, and D&R are previous state-of-art models explained in the paper. B-GST and G-GST are our models. YELP, AMAZON, CAPTIONS, POLITICAL, and GENDER are the datasets, each used for a different style transfer task.

Table 6 from our paper. Sample outputs of our models compared with previous state-of-art models.

From these examples, you can see that our models perform style transfer better than the previous state-of-art models do (the paper provides metrics to justify this quantitatively too). Our models’ output sentences are more fluent and natural-sounding, retain the content of the input sentence better and match the target style better.

What makes the GSTs produce better outputs than previous models?

Learnings and takeaways.

For one, using the architecture and pre-training of the GPT, which combines the power of both the transformer and massively pre-trained language models. Previous models that used LSTMs (for instance) fail at generating longer and more complex sentences.

Second, using BERT’s attention weights to delete attribute words works better than previous methods that attempted to use a deletion-based approach. This could be because BERT also combines the power of transformers and massive pre-training (as a Masked Language Model).

Lastly, the datasets that we used all required only localized edits on input sentences to transform them into the output. Most previous works don’t leverage this observation while we do so in using the delete-retrieve-generate framework.

What next?

Where are we at Agara headed with this line of work?

To improve the quality of style transfer itself, we’re working on using reinforcement learning. We’re also looking at shaping the attention weights of the Delete Transformer such that the shaped attention weights are better indicators of attributes. Further, we will adopt the delete-retrieve-generate mechanism in our conversational agents, as mentioned earlier on in this blog.

More updates on all of the above will follow in our future blog posts.

Do check out our paper for more detailed explanations of our work!