Facebook AI TransCoder

Published in

VisionWizard

6 min readJun 30, 2020

A highly efficient model for translating various programming languages from Facebook AI.

The Commonwealth Bank of Australia spent around $750 million and 5 years of work to convert its platform from COBOL to Java. -[1]

In natural language, recent advances in neural machine translation have been widely accepted, even among professional translators, who rely more and more on automated machine translation systems. A similar phenomenon could be observed in programming language translation[1].

Introduction
Background
Model
Preprocessing
Evaluation
Results
Examples of Translation
References

1. Introduction

A Transcompiler is used as a source to source translation system, which is used to convert source code from one language to another. They rely on handwritten rules applied to the source code abstract syntax tree.
The translation process is time-consuming as well as cumbersome. It requires extensive knowledge of both the source and target languages.
Although neural machine translation systems can prove to be successful, their applications have been limited due to the shortage of parallel data available in the domain.

2. Background

The main aim is to translate an existing codebase written in an obsolete or deprecated language to a recent one or to integrate code written in a different language to an existing codebase.
In [1], they present a model that efficiently translates functions between 3 languages(c++, Python, JAVA) based on monolingual source code. They assert that generalization to different languages is also possible.
They make use of open-source codes from Github Projects. The methodology used in [1] is an unsupervised machine translation. This is because it is difficult to create a parallel corpus for training purposes in case of a supervised model.
For evaluating the model, they prepare and release a test set of 852 parallel functions.

3. Model

For TransCoder, a sequence-to-sequence model with attention, composed of encoder and decoder with a transformer architecture, is used.
A single shared model is used for all languages. The training is done using three principles: initialization, language modeling, and back-translation.

3.1 Cross programming language model pretraining

Pre-training is a crucial component of unsupervised machine translation that ensures the sequences with a similar meaning are mapped to the same latent representation, regardless of their languages.
Cross-lingual word embeddings can be obtained by training monolingual word embeddings(common keywords such as for, while, if, try,etc.) and aligning them in an unsupervised manner.

The pretraining strategy used here includes masked language modeling in which input stream of source code sequence is randomly masked and fed into a TransCoder to predict the tokens that have been masked out based on their contexts.

Figure 1: Illustration of the three principles of unsupervised machine translation used in [1]

3.2 Denoising Auto Encoding (DAE)

Cross-Lingual Model (XLM)
XLM uses a known pre-processing technique Byte-Pair Encoding (BPE) (that splits the input into the most common sub-words across all languages, thereby increasing the shared vocabulary between languages) and a dual-language training mechanism with BERT in order to learn relations between words in different languages. A detailed explanation can be found here.

XLM pre-training allows the seq2seq model to generate high-quality representations of the input sequence. However, the decoder cannot translate as it has never been trained to decode a sequence based on a source representation.
DAE objective operates like a supervised machine translation algorithm, where the model is trained to predict a sequence of tokens given a corrupted version of the sequence(by randomly masking, removing and shuffling input tokens).
It also trains the language modeling aspect of the model, i.e., decoder is always trained to generate a valid function even when the encoder output is noisy.

3.3 Back Translation

In practice, XLM pre-training and denoising auto-encoding alone is enough to generate translations. However, the quality of these translations tends to be weak, as the model is never trained to do what it is expected to do at test time. To address, this issue, back translation is used to leverage the scenario.

Note: Detailed experimentation setup can be referred from Section 4: Experiments of [1].

4. Preprocessing

Preprocessing involves using a common tokenizer and a shared vocabulary for all languages, which will reduce the overall vocab size and maximizes the token overlap between languages, improving the cross linguality of the model.
A universal tokenizer would be suboptimal, as different languages use different patterns and keywords. The logical operators such as && and || exist in C++, where they should be tokenized as a single token, but not in Python.
The indentations are critical in Python as they define the code structure but have no meaning in languages like C++ or Java. The tokenizers used are javalang5 tokenizer for Java, the tokenizer of the standard library for Python, and the clang7 tokenizer for C++.
These tokenizers ensure that meaningless modifications in the code (e.g., adding extra new lines or spaces) do not have any impact on the tokenized sequence.

Figure 3: Example of tokenization performed during Preprocessing in [1]

5. Evaluation

Various coding problem solutions from GeeksForGeeks are used to extract a set of parallel functions in C++, JAVA, and Python for creating test and validation sets. These functions not only return the same output but also compute the result with similar algorithms.
The transCoder is evaluated with different metrics such as reference match(percentage of translations that perfectly match the ground truth) and BLeU score(based on the relative overlap between the tokens in the translation and reference).
A limitation of these metrics is that they do not take into account the syntactic correctness of the equation.
Thus, a new metric, computational accuracy, is used to evaluate whether the hypothesis function generates the same outputs as the reference when given the same input.

6. Results

Many failures come from compilation errors when the target language is Java or C++. It suggests that the model could be improved by constraining the decoder to generate compilable code.
Runtime errors mainly occur when translating from Java or C++ into Python. Since Python code is interpreted and not compiled, this category also includes syntax errors in Python.
The majority of remaining errors are due to the program returning the wrong output on one or several of the unit tests.
Infinite loops generally cause timeout errors and mainly occur in the Java-Python pair.