A Python to Scala transpiler using neural machine translation (NMT)

Matt Hagy
4 min readFeb 20, 2019

--

Get the code on GitHub project nmt_python_scala_transpiler.

Summary

In this small project, neural machine translation (NMT) is used to convert a programming expression in the programming language Python into an equivalent expression in the language Scala. I.e., machine learning is used to create a Python-to-Scala transpiler. This is accomplished by adapting the excellent work of Zafarali Ahmed in keras-attention, which was originally developed to convert dates from varied human-readable formats to machine format. The results show these methods succeed at converting basic Python expressions into Scala, but the methods struggle with more complex nested expressions. Let me know if there are other NMT methods I should be investigating to more robustly solve this problem.

Generating training data

A Scala program (source) is used to generate random programming expressions in a common representation that can be converted to both Python and Scala. These expressions primarily focus on list comprehension and function application. Here are some example expressions.

Example expressions used in this work

In total, 400,000 expressions are generated. A larger number of expressions couldn’t be processed in subsequent training due to memory constraints on a 30 GB VM. In the future, I may try creating a VM with more memory to leverage more training observations.

Model training

Jupyter Notebook

I use the NMT model developed by Zafarali Ahmed in keras-attention. Only a small amount of custom code is created to work with the data in this current project. A model is trained overnight using a GPU-equipped VM on AWS EC2.

To demonstrate how the model improves over time, let’s consider the following example Python expression, which contains both list comprehension and function application.

[i(c,z) for z in t if z]

The equivalent Scala expression used to validate the NMT model is

t.filter(z => z).map(z => i(c,z))

The model generates the following outputs for this Python expression input at a varying number of training epochs.

5  epochs: c(m).ter(. (>(p)))))
10 epochs: m(filter(. => p(=( (()))
15 epochs: e.filter(e => p).map(t => p(.,a)
20 epochs: i.filter(m => m).map(v => h(z,v))
30 epochs: z.filter(z => z).map(z => c(c,))
40 epochs: f.filter(z => z).map()
50 epochs: t.filter(z => z).map(c => c(c,c))
60 epcohs: t.filter(z => z).map(c)
70 epochs: f.filter(z => p).map(c)))
80 epcohs: t.filter(z => z).map(z => i(c,z)) [Correct solution]
90 epochs: t.filter(z => z).map(z => i(c,z))

After 90 epochs of training, the model achieves a character-level accuracy of 98.23% on a separate validation set. The model correctly generates the exact Scala expression for 56.5% of Python expressions in the validation set.

Here are some examples of applying the trained model on a diverse set of inputs.

Example of the NMT methods applied to validation expressions

We can see how the model still struggles for more complex nested function application expressions, but it seems to have captured the necessary patterns for converting simple Python expressions to Scala.

Discussion

Overall, it was straightforward to programmatically generate a large training corpus of comparable Python and Scala expression pairs. From there it was simple to use the NMT model developed by Zafarali Ahmed in keras-attention to develop a machine-learning-based transpiler.

The trained model succeeds in converting simple Python expressions into Scala. In contrast, more complex expressions are not properly converted by the current model. It’s possible that more training would help.

In manually inspecting examples that the current methods fail at, I commonly see nested function applications expressions such as z(a(x(y),h(y,p)), y). It’s possible that properly handling such nested expressions would require a different solution. To solve this specific problem, we could parse the expression into subexpressions using temporary variables. E.g.,

tmp0 = x(y)
tmp1 = h(y,p)
tmp2 = a(tmp0,tmp1)
z(tmp2,y)

Each subexpression could be robustly converted and we could stitch the results back together. We could also recognize that such function applications are equivalent in both Python and Scala and don’t require translation. In general, I’m sure there are plenty of improvements like this that could be used to develop a more robust transpiler. In the extreme, we could write a transpiler entirely programmatically —as is the conventional approach —and present work here focuses on leveraging NMT to this end.

I’d be interested in ML methods that can capture recursive patterns as I believe will be necessary to deal with these highly nested expressions. I’m not familiar with such methods so please let me know if there’s a good review article that I should check out.

The programming expressions considered in present work are relatively simple and lack some of the more complex cases that can be encountered in actual code. The training set generation code could be revised to generate a larger and more diverse class of inputs to create a more substantive transpiler. Feel free to modify and extend this work towards the creation of a better transpiler using NMT.

Closing Remarks

Thank you for your time in reading about this project! Let me know if you have any thoughts and if there’s any prior work I should review and cite. All feedback is appreciated.

Also, you can repeat this project and extend it by building off the GitHub project nmt_python_scala_transpiler.

--

--

Matt Hagy

Software Engineer and fmr. Data Scientist and Manager. Ph.D. in Computational Statistical Chemistry. (matthagy.com)