Natural Language Generation is the process of generating coherent natural language text from non-linguistic data. Though the community has been generally going for speech and text output for these models, there has been far less certainty in the inputs. A large number of inputs have been taken for NLG systems including images, numeric data, semantic representations and Semantic Web (SW) data. Presently, the generation of Natural Language from SW, more precisely RDF data, has gained substantial attention and has also been proved to support the creation of NLG benchmarks. However, most models are aimed at generating coherent sentences in English, whilst other languages have enjoyed comparatively less attention from researchers. RDF data is usually in the form of triples, <subject, predicate, object>. Subject denotes the resource, predicate denotes traits or aspects of the resource and expresses the relationship between subject and object.
In this project we aim to create a multilingual Neural verbalizer, ie, generating high-quality natural-language text from sets of RDF triples in multiple languages using one stand-alone, end-to-end trainable model.
MY WORK BRIEFLY
- Implement the entire NLG pipeline based on Graph Attention Networks for a multilingual neural RDF verbalizer.
- Test the new architecture with the Transformer architecture as baseline on individual languages initially.
- Experiment with the new architecture on the multilingual problem and log the results
WHAT IS DONE
- Implemented the Graph Attention Networks (GAT) based architecture with the entire pipeline, from data processing to inference.
- Tested the GAT model on three languages English, German and Russian and collected the results and their BLEU scores
- Implemented multilingual training with knowledge distillation
- Tested the GAT model on the multilingual corpora and logged the results
- Trained all English, German, Russian individual models along with English -German-Russain and English-Russian multilingual models.
- Trained OpenNMT based Transformer models on the same set of corpus to have a baseline for comparison
- Extensive experimentation on the multilingual corpus with knowledge distillation enabled
- Training English-German multilingual model.
Graph Neural Networks are created to directly encode the nodes based on their position in the graph along with the properties of nodes and their neighbourhood, Graph Neural Networks are neural models specifically trained to directly operate on graphs and learn dense representations of nodes such that the structural information about the graph is retained. We represent the RDF set as a graph were subject and object are nodes, predicates between them are edges. For example, <Donald Trump, president_of, U.S.A> can be seen as a small part of knowledge graph where Donald Trump, U.S.A are the vertices(nodes) and president_of is the edge between them. The model must know the importance of one node to another in the graph to properly realise how a node affects the description. Akin to the problem statement of NMT and the recent advancements in Machine Translation, we aim to use those techniques to improve our verbalizer fluency. The latest architecture that has been very successful in NMT, as of writing this proposal, is the Transformer (self-Attention networks). The Transformer uses self-Attention mechanism to determine the importance of a word in a sentence to the entire translation. In turn, Graph Attention Networks apply this technique to Graph Neural Networks to give importance to different nodes in a graph when working on one node for any task. In our problem, we want to generate descriptions for a set of nodes in the form of RDF data, whilst also factoring in how various nodes affect our final thought vector or how important are certain features for certain nodes.
We experimented with the new architecture on both monolingual and multilingual corpora. We initially tested the model for English, German and Russian models separately and then tested it for complete Multilingual model and English-Russian combinations for the multilingual experiments.
We also tested the model with Sentencepiece-based BPE-trained tokenizer and observed huge improvements in some interesting aspects.
The training of models proved to be extremely challenging because of their complexity. Each model takes about 10–12 hours on Google Colab with reasonable batch size and model size. To achieve state-of-the-art we reckon we need to move on to distributed training techniques to train with bigger batch sizes and longer periods of time on a professional cloud platform.
The Tensorflow tokenizer which was used initially showed some interesting inconsistencies during the training. It worked as expected with German and English models, given the great results, and also with both the complete multilingual model and English-Russian model. But it failed to provide good results on English only model and English-German model. We rechecked the implementation several times but found it to be correct. We think it’s very useful to move on to a more reliable and better tokenizer like sentencepiece. An interesting property of this problem is that the English results generated by the English-Russian, and all multilingual model are impeccable.
We observed extremely exciting results with our architecture and a significant improvement over baseline Transformer models, we also argue that our models can be made significantly compact and dense with knowledge distillation, but we need to continue experiments to gain substantial evidence. We believe there is scope for improvement and scaling to numerous languages. We need to move towards a more consistent tokenizer for better and consistent results, and we began that by experimenting with sentencepiece. Our initial goal was to train a multilingual model, but we went further and experimented with alphabetically dissimilar languages like English and Russian and got interesting results. We believe more experiments to show more interesting properties.