Synthesizing Programs with Deep Learning

7 min readMar 25, 2017

The buzz about machine learning or deep learning or AI is promising and deafening at the same time. Almost like a new beast in the town, all set to take over everything. I’ll try in this article to motivate the developments from the point of view of program synthesis — getting computers to write programs for us automatically. We will discuss how Deep Learning and Program Synthesis are intricately related, motivating why Deep Learning is relevant to every programmer, irrespective of their background.

Let us begin on a somewhat familiar, common ground:
The digital revolution — software eating industries.

Software = lots of code written by lots of programmers. Bugs, good/bad coders, product management, lifecycle, etc
Code = a sequence of rules written painstakingly by a human being H in an abstract language for executing on hardware.
Language = set of primitives/templates: if-then-else, while-end, def func, try-catch. A good ‘high-level’ language makes it easy for humans to talk to machines. Here talking = translate high-level goals to low-level template instances, to be hardware-executed. This translation task is a painful one, yet we have a large chunk of humanity very eager to do it, day in and day out.

Unsurprisingly, a lot of people also want to get rid of the pain.
Enter Program Synthesis = computer writes the program for you.
How?

Compiler/Translator: The traditional way. Input a high-level program HP; a human H writes rules to convert HP to a low-level program LP (LP is executable on hardware, transitively). The software ecosystem is build over a web of such translators — from macro expanders to heavy-duty compilers and interpreters. Impossible that we could some this far without a thriving industry of program translators, abstractions and high-level libraries. But, even with them, writing code is still hard. Programming should be easier, no?
Programming by example: This is awesome — all you do is to give examples. Example is a pair (I, O). Computer guesses a program P which takes in I, outputs O. Collect I, O, push a button, and you have P. Done!

Programming by Example (PbE) is cool, perhaps the holy grail. Imagine a world where your only job was to provide a few correct input and output examples for something you wanted to build. You didn’t have to hard-code those rules of transformation in Python/Java/… ; the computer figures out the rules for you. That is, the computer writes the program for you, from the examples you gave. Unfortunately, generic PbE doesn’t exist in practice — in the recent years, we’ve seen specific instances come to prosper.

But, PbE isn’t magic, right? How does it work? The principle is simple — learn from mistakes and rectify iteratively. Here is a basic workflow:
1. Fix a template over which you will create the program. Template = data-flow structure + parameters, e.g., a decision tree (sequence of nested if-then-else statements, parameters = branch conditions)
2. Guess a set of values for parameters W, to create a program instance P’.
3. Compare output O’ = P’(I) with expected output O on some example pair (I, O). Since you simply guessed P’, O may not equal O’.
4. Compute error E = Diff(O, O’). Figure out why O and O’ are not equal.
5. Use E to tweak some params W’, which may make O’ equal to O.
6. Go back to 3.

When do we finish? Ideally when O’ is close enough to O for all examples, but then what is close enough? What if we have millions of examples? So, in practice, we setup (one or more) aggregate similarity metrics m over all examples and stop when m crosses over a threshold. Use surrogate metrics if the true metric is hard to formalize or automatically learn from. If your template memorizes the data too well, without learning the relevant input predicates/features, that is a problem. So, split data into train and test sets —synthesize P’ over train set examples, check how it performs on test set.

Note that this workflow enables us to approximate the true program P for the task (if it exists) with a synthesized P’ and improve P’ iteratively. Also, in contrast to our traditional software engineering goals (write a bug-free program), we don’t aim for the perfect P (perhaps it does not exist, or perhaps too hard to find); instead, a good enough P’ is OK.

The above workflow may look pretty familiar if you’ve studied Machine Learning (ML) — it corresponds to supervised ML.
Examples may be a table of labeled observations or a set of images or a set of sentences and their translations. The templates may vary from linear models to decision trees to neural networks, e.g., a popular neural network template instance is (Conv -> Pool)* -> FC -> Softmax. Each template type has its own specialized learning algorithm and optimized implementations. Note that the end results, the models, are in fact, the synthesized programs. To enable automated synthesis, i.e., automatic error computation and parameter correction (steps 4 and 5), we make the metrics differentiable.

Machine learning (ML) and PbE, aren’t really that different.

ML models are, in fact, probabilistic programs, which map inputs to a set of real numbers. These numbers are samples from a probability distribution of a random variable, say, which takes category names as values. For example, if you wanted to classify an image I into a cat, dog or none, the model/program outputs three real numbers, each of which is the conditional probability of I being in the respective class, P(class | I).

A sklearn.LogisticRegression instance synthesizes a probabilistic program for separating your data: for each observation input X, output y is a (category -> weights) mapping; you may pick the category with highest weight to obtain a discrete answer.

Note the change of viewpoint here — traditionally, we assume a software component returns a single right answer. With probabilistic programs, there are multiple answers, and the program only tells you the relative weights of each answer. Viewing ML algorithms as program synthesis algorithms and ML models as programs also helps fit in with existing software ecosystem, e.g., a ML model/program can be a sub-module of another one and we can create an hybrid ecosystem of rule-based and example-based programs.

Traditional ML algos aren’t pure PbE though — they involve

(A) rule-based feature engineering, followed by

(B) an automated PbE/model synthesis process.

Stage (A) transforms example inputs I to a more stage (B)-friendly format — enabling (B) to synthesize better models. The human programmer/ML-guy (recall H from earlier) now gets stressed about two things.

In stage (A), (s)he has to figure out all the magical rules to get the inputs in a ‘good enough’ feature format. In stage (B), (s)he has to pick the right ‘hyper-parameters’, which enable the PbE algorithm to deliver the best results.

In the last decade, we have seen a remarkable improvement in a particular PbE method, now well-known as Deep Learning. Here the template is a multi-layer neural network (NN) — parameters in each layer are learned using a back-propagation algorithm. Hyper-parameters include the learning algorithm parameters, choice of layers, similarity metrics and so on. Thanks to deep learning, we are now able to synthesize reasonable programs for the so-called natural inputs: speech, images and text (SIT). Moreover, we are rapidly making these programs better, so that we can rely more and more them in practice.

It is useful to reflect a bit on the kinds of programs we may write over SIT data types: recognizing/detecting/segmenting objects in images and videos, classifying/labeling/translating/summarizing texts, question answering, audio to text, language/speaker/voice-activity identification, to barely scratch the surface. We also want to work across modes — caption images, generate image/audio from text and finally have a common representation for ‘meaning’ across modes. Clearly, there is a vast spectrum of such programs, with plenty of use cases. Recall that, for decades, across multiple hype cycles, we have yearned without success to get these programs ‘right’.

Deep Learning, like Machine Learning, also involves the two stages (feature engineering, then model synthesis), with the promise that the programmer H will spend minimal effort on stage 1. Given the variety of good SIT programs we can write now, I believe the promise is well-kept so far. Plus, the global presence of great human knowledge distillers and availability of tools/resources and multi-modal data have added to the DL buzz. Programs over SIT domain were hard to write using the traditional rule based systems or ML systems (hard to concoct rules for good stage 1 features manually). With DL, it is a win-win: we write minimal rules to obtain these ‘richer’ programs.

Viewing DL as Program Synthesis also helps when we study different neural network architectures. Feed-forward networks are acyclic programs (or circuits). Recurrent NNs are programs with loops (or sequential circuits). The primitive building blocks of networks are growing in variety — we had only linear, convolution and recurrent blocks initially. Now we’ve discovered more — dilated blocks, residual blocks, attention blocks, (batch/layer/..) normalization blocks, gated blocks, and so on. These are akin to hardware instruction set architectures (ISAs) — the basic units of assembly language programs. Depending on how rich our instruction set is, we can write more concise and better performing programs. Thus, as we discover new primitive building blocks, we will be able to translate our high-level tasks into programs more easily.

Finally, note that the dream of pure PbE isn’t realized yet. DL architectures consist of these primitive (Lego!) blocks — for a given task, picking the right grammar over these blocks is an art. As of now, the programmer H must spend a lot of effort in data engineering and figuring out the right grammar/architecture using trial-and-error. Pruning or automating the tedious trial-and-error is perhaps the next frontier towards pure PbE.

Wrapup

ML/DL are instances of Programming by Example. Interestingly, we can now write more complex programs, not by writing rules, but by collecting lots of examples and doing program architecture/data engineering.
The ML/DL hype isn’t out of the blue. Several AI winters have preceded it. So have many past efforts to synthesize useful programs from examples — we have to see how far this trend permeates the rest of the software ecosystem.
Some parts of this article may be over-simplified but I hope not wrong. Please feel free to comment.

Synthesizing Programs with Deep Learning

Wrapup

Written by Nishant Sinha