Transformers to the rescue

Niloy Purkait
4 min readJul 27, 2020

--

Data to text language generation with neural networks

In my last post, I elaborated upon my modelling approach, for the use case of RDF-to-text generation. The task is part of a larger open source project to which I am contributing for the Google summer of code program (GSoC 2020).

Link to previous articles and resources:

As a reminder, this project consists of transforming a knowledge base, represented by a set of RDF triples, into a natural language text, using various neural architectures trained in an end-to-end fashion. This post serves as a quick update on the current state of the project, while highlighting key ideas, inspirations, and obstacles that were faced during the course of my experimentation.

Attention is indeed all you need!

Out of the successfully implemented models proposed in the previous blog post, the best performing neural architecture proved to be the Transformer. The winning model was composed of an embedding layer with a size of 128, followed by 8 layers, 4 for encoding and 4 for decoding, each equipped with a multi-head attention size of 128 units, and 8 attention heads. Dropout regularization is during training, and layer normalization is applied to the output of each encoder and decoder layers. The models were trained for 10 epochs, after which their generated outputs were evaluated using BLEU and METEOR scores (standard in machine translation, and by extension, in text generation). Upon visual inspection of the output, the transformer produced on average very decent verbalizations, such as this one, with a triple set size of 3:

Transformer’s generated output, with 3 input triples

In general, the transformer does pretty well up till triple set size 4. At triple set size of 6 however, we see a very different story unfold. The transformer starts to lose its focus, producing irrelevant verbalizations like:

Transformer’s generated output, with 6input triples

However, since the other models perform even poorly (e.g. the GAT starts to ramble on nonsensically with inputs of triple set size 5, whereas the LSTM starts hallucinating words as early as at triple sets of 4), the transformer will have to do for our generator model. Due to the parallelization advantages of the transformer architecture, it also trains much faster than its recurrent counterparts, which will be crucial to evaluate the benefit of adversarial training and reinforcement learning, at later stages of the project.

And what about the discriminator?

Now all armed with an appropriate generator model (which has essentially been pre-trained for 10 epochs), I decided to construct a simple transformer model that I can use as the discriminator network, during adversarial training. The idea here is to make the discriminator be able to discriminate between real and generated texts. However, there is a slight issue here. If the discriminator only receives the generated output from the generator, and compares it to real target instances, it will get confused! What it needs, is some context. What do I mean by that? Consider this : our pre-trained generator will produce very realistic looking text on many instances, which might just happen to not match the input triple set, in terms of information conveyed. Yet, how will our discriminator know that? We cannot simply show the discriminator various strings of text, perfectly correct in syntax, and semantics, and ask it to tell which one is real or fake, without giving our model any context regarding the corresponding input triple.

Thus, my approach was to simply concatenate the input triples with its corresponding generated or real output, and feed the result to the discriminator. This way, the discriminator actually receives both some context (from the first part of the sequence) along with the potential corresponding text, thereby essentially performing a sort of sequence classification — i.e. whether a given RDF triple-text sequence is real, or fake. I tested out the concept by pre-training the discriminator using real triple-text sequences (labelled 1) and fake sequences I constructed by randomly concatenating input triple sets with target text instances (labelled 0). I trained my model (2 layers, 2 heads, 32 neurons with an embedding dimension of 32) on this artificially constructed dataset for 10 epochs, at which point it was able to achieve a validation accuracy of 95%. Satisfied by these initial results, I merged my generator and discriminator networks into one glorious model, and setup the adversarial training setup.

Up next

In the next post, I will reveal the final results obtained through the adversarial training approach, and evaluate the utility of using reinforcement learning for the given use case. For now, I hope you enjoyed this update on the progress of my GSoC 2020 project. Stay tuned for more!

Written by : Niloy Purkait

--

--

Niloy Purkait

Data Scientist | Strategy consultant | Machine and Deep learning enthusiast. Interests range from computational biology and theoretical physics to big data tech