# NeuroNuggets: Neural Programming, or ICLR in Review II

In the previous NeuroNugget, I talked about Dmitry Vetrov’s three papers that were accepted at ICLR 2019 as an excuse to discuss Bayesian methods in deep learning. Today, we will look at another topic that featured quite prominently at ICLR: an exciting subfield of deep learning where neural networks are learning to *program*. ICLR 2019 has accepted one paper on *program induction* and three on *program synthesis*, and it turns out that in essence, all of these methods are trained on synthetic datasets, which is one of the main focuses of Neuromation. In fact, one of the ICLR papers in question presents an improved way to generate this data.

So, will software developers be out of a job soon? Let’s find out. We will begin by reviewing the most interesting ideas and works in the field and then proceed to more recent developments.

# Program Induction and Program Synthesis

Artificial intelligence researchers have wanted to teach computers to program for a long time. Back in 1970, two Stanford researchers, Zohar Manna and Richard Waldinger, published a paper called “Towards Automatic Program Synthesis”. They did it, in the fashion of the times, by extracting the program from a formal proof of a theorem derived from the program’s formal specifications. In this way, the theorem was supposed to be proven automatically. Waldinger (in collaboration with Richard Lee) even implemented this approach, writing a program called PROW (for “program writing”) that used a resolution-based theorem prover and could output programs in LISP along with a formal proof of their correctness.

This was only the beginning, and such deductive methods for automated program synthesis became an entire field of study.

But today, we will be talking about a different problem setting sometimes called *Programming by Examples* (PBE), whereby a program is derived from a set of its sample results and input-output pairs. The deductive approaches pioneered by Manna and Waldinger are alive and well in PBE: for instance, the field was largely initiated by Sumit Gulwani’s paper called “Automating String Processing in Spreadsheets Using Input-Output Examples”. In this paper, Gulwani defined a formal language for string manipulation that is, on the one hand, expressive enough to be useful, and on the other hand simple enough to be amenable to PBE.

The basic idea of teaching a machine learning model to program comes in two forms:

*program induction*aims to train an end-to-end model to capture an algorithm;*program synthesis*tries to teach a model the semantics of some (usually pretty restrictive) programming language so that the model is able to generate programs.

Basically, in program induction, the network *is* the program, while in program synthesis the network *writes* the program.

Intuitively, program synthesis seems preferable for real use, at least because it is far easier to test. In program induction, you have to keep testing the black box extensively, and even then you can never be sure it’s correct; whereas in program synthesis, you can rely upon a wide variety of program verification methods (that come from mathematical logic), and for the kind of programs we are talking about in modern state of the art, verification is almost always trivial (the problems are seldom more complicated than simple arithmetic). On the other hand, program synthesis is much harder: writing out the “vague understanding” that a black box might have sounds like a very hard problem in itself.

Naturally, both problems require large datasets of programs together with their input-output pairs. At present, no such large datasets exist, although some effort has been made to collect real programs by extracting snippets from programming competitions and the like, see (Zavershynskyi et al., 2018). But, thankfully, generating synthetic programs and running them is relatively easy, arguably even easier than generating synthetic data for computer vision. So, all modern approaches use synthetic data (randomly generated programs) to train their “neural computers”.

# Program Induction: from Turing Machines to Passing High School Math

It all started with a paper by Alex Graves et al., with the intriguing title *Neural Turing Machines*. In case you are new to computer science (welcome!), a Turing machine is the most popular and most widely used elementary formalization of a computer. It’s got an infinite memory tape and a head that can move around the tape and read and write symbols (see more, e.g., on Wikipedia, and don’t miss the visualizations). The programs are transition rules in the set of the machine’s configurations; a transition can be conditional on what symbol is under the tape at any given moment, but that’s it.

Alan Turing found that even this elementary construction is expressive enough to implement basically any possible algorithm (this idea is known as the Church-Turing thesis). By now, Turing machines are the most fundamental construction in computer science: we define an algorithm as a Turing machine, we define complexity classes based on the time Turing machines take to solve problems, and so on.

The idea of Graves et al. was to use neural networks to operate with external memory, which, similarly to Turing machines, is formalized as a “tape” or a set of memory locations. The neural network serves as a controller, telling the basic architecture what to read from memory, what to write there, and what to output to the environment.

The memory matrix **M***t* at time *t* consists of N memory locations, each containing a vector of size M, and the neural controller, based on memory location weighting **w**t (kind of an attention mechanism), produces two control vectors:

**e**t, the*erase*vector, that defines what to delete from memory:**M’**t(*i*) :=**M**t-1(*i*)(1-*w*t(*i*)**e***i*);**a**t, the*add*vector, that defines what to delete from memory:**M**t(*i*) :=**M’**t(*i*)+*w*t(*i*)**a***i*

The weighting **w**t is produced by a separate part of the controller (the most interesting part of the architecture actually, but let’s not dwell on that now).

For training, the network receives an input “tape” (sequence of symbols) and an output tape and has to be able to produce the output from the input:

Graves et al. showed that the NTM architectures, especially with recurrent (LSTM-based) constructions, are able to generalize well, learning the ideas of programs from examples. For instance, an NTM trained to copy a vector from input to output (after a delimiter is seen, so you have to store the sequence in memory before writing it out) was able to generalize from examples of length 20 and then copy sequences of length up to 120. And if you look into what the NTM has learned to do, it indeed looks like a simple algorithm for manipulating arrays: first read the input cell by cell into memory, then read the memory cells one by one, copying them to the output tape. Here is a map of memory use for reading and writing:

The next development by Graves et al. came in 2016, when they developed NTMs into a *differentiable neural computer* (DNC), an architecture featured in *Nature* (Graves et al., 2016). They added temporal links that are able to tell which memory cells have been written to earlier or later and memory usage information that can inform the controller about memory state in a more direct and simple way. The resulting architecture (shown below) is able to learn simple programs even better, although training it is even harder.

Another successful take on program induction is the construction of Neural GPUs (Kaiser and Sutskever, 2015). They note that neural Turing machines are hard to train due to their inherently sequential nature: NTMs try to mimic Turing machines that are regular sequential computers with very limited computation done on each step. But programs with a lot of steps (i.e., any realistic ones) are very hard to train in this way. Neural models are much more suitable for GPU-like computation, a relatively short sequence of highly parallelized and powerful computations. The architecture is based on applying a sequence of convolutional GRUs, a radically simpler idea that can be compared to regular or stacked RNNs rather than involved architectures such as the neural Turing machine. See below:

Despite their simplicity, Neural GPUs were able to train surprisingly well, learning to perform both binary addition and even binary multiplication — a very hard problem! Multiplication is a highly nontrivial superlinear algorithm, and previous models had absolutely zero success with it.

There have been more developments in program induction, of course, but let’s skip forward to the ICLR 2019 paper by DeepMind researchers Saxton et al. (2019). In a work called “Analysing Mathematical Reasoning Abilities of Neural Models”, they add another level of indirection to the program induction problem, asking models to answer mathematical questions formulated in natural language. That is, the inputs can look like this:

- Solve 2(x-5)+5=10-x for x
- Factorize 2x²-x-1.
- Calculate -841880142.544 + 411127.
- Three letters picked without replacement from qqqkkklkqkkk. Give prob of sequence qql.
- Let u(n) = -n**3 — n**2. Let e(c) = -2*c**3 + c. Let l(j)= -118*e(j) + 54*u(j). What is the derivative of l(a)?
- Is 106033 prime?

Sounds pretty complicated, right? E.g., primality testing is a really complicated algorithm, and one cannot really expect a recurrent neural network to pick it up from examples. Still, Saxton et al. show that the standard state of the art attention-based architecture, the Transformer (first introduced for machine translation in “Attention is all you need” and widely used all over natural language processing since), performs pretty well, giving plausible answers even for very hard questions!

The most interesting test for the Transformer came in the form of a standard math exam administered to 16-year-old British schoolchildren. Saxton et al. excluded problems with plots or tables and came up with 40 problems that cover a similar range of problems as they had trained on. The trained Transformer-based model got *14/40 questions correct*, equivalent to an E grade student! So, while we are not quite there yet, pretty soon some collection of deep learning models may be able to graduate high school, at least in math.

# Neural Program Synthesis

Okay, that was program induction, but what if we want models to actually *write* the program? Usually program synthesis models learn an algorithm from examples in the form of an *abstract syntax tree* (AST). ASTs are a natural internal representation of algorithms in the form of an execution tree that is produced by programming language parsers. For example, here is the Wikipedia example, an AST for Euclid’s algorithm that finds the greatest common divisor:

The first attempts to solve the problem with deep learning used improvements of the well-known Flash Fill algorithm (Gulwani, 2011) designed to infer a string transformation algorithm in the form of an AST. Flash Fill was developed by Microsoft researcher Samit Gulwani, and is more popular than you can imagine: it is installed into every copy of Microsoft Excel! Flash Fill is the algorithm that fills in Excel cells based on an inferred pattern (see, e.g., here).

Parisotto et al. (2016) presented what they called a Neuro-Symbolic program synthesis method based on Recursive-Reverse-Recursive Neural Networks (R3NNs). The problem, again, is to produce an AST for string manipulation. Here is an example where we need to transform examples on the left into the program on the right that transforms a full name into last name with the first initial:

The model gradually constructs an AST consistent with input examples by building partial trees and expanding their leaves one by one, until all leaves represent terminals. To do that, the R3NN does two passes along the tree, first computing internal representations (embeddings) for every node from leaves to root and then computing the probabilities of valid expansions during the backward pass. We won’t go into too much detail, but here is the general idea:

Balog et al. (2016) present the so-called DeepCoder model. The idea here is that actually, the program synthesis problem can be solved with search-based techniques: one can simply try to enumerate all possible ASTs, discarding those that are inconsistent with available examples. Naturally, even for restricted domain-specific languages this kind of search is very computationally intensive, let alone for general-purpose programming languages, and can be done only for very short programs. But then, it’s not like end-to-end techniques can find long programs either, and there is a place for machine learning in search-based methods: models can learn what kind of nodes (functions) or tree structures are more likely given the available examples and thus guide the search. Specifically, DeepCoder predicts the probability that a given function will appear in the code, given input-output examples:

Balog et al. report that the resulting guided search-based system speeds up regular search by several orders of magnitude and can even solve some problems from real-life programming competitions (hence DeepCoder, which is reminiscent of TopCoder).

The search-based direction was continued in (Vijayakumar et al., 2018), another publication affiliated with Microsoft Research. There, the authors go back to the string transformation domain and use neural networks to specifically predict how to expand the current node, with an LSTM-based model like this:

ICLR 2019 has three papers related to neural program synthesis. Chen et al. (2019) from UC Berkeley proposes the idea of *execution-guided synthesis*: by encoding the current program state, we can execute partial programs and condition the next step of the synthesizer on the intermediate state representation. On the other hand, Si et al. (2019) consider so-called *syntax-guided synthesis*, where the space of possible programs is constrained by a grammar and a logical formula (specification). Formally speaking, this is a generalization of traditional program synthesis because you can write down the dataset of input-output pairs as a logical formula, but in reality this is a completely different problem. Si et al. use reinforcement learning to train the encoder and the policy network together.

There is also a completely different but quite interesting direction of study where synthesized programs are instructions for a robot, and the synthesis is based on some visual cues and/or instructions given in a natural language. This is beyond the scope of this post, but maybe someday we will come back to this field, as synthetic data plays a huge role there as well.

Nevertheless, it is clear that neural programming is still at its very inception. Existing systems can “write” only very simple programs. In my opinion, a necessary step to improving the current models will be to improve the datasets they train on. So let us conclude with how synthetic data is used for neural programming, and how it can be improved further.

# Synthetic Data for Program Induction and Synthesis

Both program induction and program synthesis models usually train on synthetic data. It is readily available and is fairly easy to work with: any model problem currently used for program induction/synthesis has fast “regular” algorithms, so one can simply randomize the input data and produce a never-ending stream of synthetic training examples.

In one of the works we considered above, Saxton et al. (2019) had to produce a dataset of natural language mathematical problems. DeepMind researchers summarize their options as follows:

There are two choices for obtaining mathematical questions: either crowd-sourced, or synthetically generated. While crowd-sourcing has the advantage of introducing linguistic diversity, as well as a diversity of problem types, it is difficult to collect and validate such data at scale. In contrast, procedural generation is sufficient for our purposes in most respects: it (1) easily provides a larger number of training examples, with (2) precise controls over difficulty levels, permitting (3) analysis of performance by question type, and (4) better guarantees on question correctness, with (5) potential for more efficient model training by varying the time spent on each module, and (6) ease of testing generalization (since one can precisely vary different axes of difficulty in different question types).

I couldn’t put it better myself.

As for program synthesis, this is finally the place to review one more paper from ICLR 2019. Previously, synthetic data for neural programming had been generated straightforwardly and more or less at random. But UC Berkeley and Google Brain researchers Shin et al. (2019) show that common synthetic data generation algorithms, including the one from the popular tensor2tensor library, have biases that fail to cover important parts of the program space and thus deteriorate the final result. Shin et al. present a novel methodology for sampling random programs and show significant improvements for two important domains: Calculator (which computes the results of arithmetic expressions) and Karel (which achieves a given objective with a robot moving in a two-dimensional grid with walls and markers).

We are excited to see new papers that study the generation of synthetic data as an interesting problem in its own right. It’s even more exciting to see them accepted to major venues such as ICLR. But there are many, many more interesting ideas in papers accepted to ICLR 2019, so stay tuned for our next installments!

*Sergey NikolenkoChief Research Officer, Neuromation*