Deep Coding with Deep Learning
2020 was a challenging year for every business but particularly those that relied on travel and tourism, and the industry serving professional conferences was left with difficult questions of how to proceed. Conferences targeting computer scientists have long been a sort of outlier in this domain, as in most other industries gatherings are built around primary agendas like building sales and procurement pipelines in some segment. For computer science research conferences these agendas are almost secondary as they have evolved to be more of an alternative to publishing through formal academic journals, serving to distribute research internationally and establish connections for collaborations across academia and industry. This made sense partly because of the accelerated pace of progress in the domain, as well as sheer scale of manpower devoting attention to the field. There are certain challenges facing society that computer scientists are in a unique position to address, the pandemic amongst them.
When the 2020 border closings and lockdowns hit, ICLR was the first top tier AI research conference to be faced with the difficult questions of how to proceed. The form of online gathering that they engineered, mostly hurriedly patched together in the space of a few short weeks preceding that event, was cohesive enough that it became a legitimate template that has been only tweaked in iterations of this and most other AI research conferences in the 2+ years since. Featuring staples like a time zone neutral presentation calendar, prerecorded poster presentations, video chats, zoom socials, and a rag tag hashtag social media stream, anyone who has attended a research conference in the time since would have found it immediately recognizable.
I make this statement cautiously, for there is still uncertainty surrounding long term trajectory, but it does appear that material progress is now being made against the pandemic. Vaccine formulas are iterated nearly as fast as the virus evolves, and vaccinated population density is encouragingly up since some early politicized hesitancy. Thus ICLR organizers find themselves once again at the forefront asking questions of how research conferences should now proceed. Do they return to a having a new country host every year? Alternate annually between virtual and in person? Establish some hybrid form of gathering? This author suggests that the only sensible way that these questions can be answered is to first attempt to clearly articulate one more critical. What are the primary agendas of AI research conferences? Is it just to pad resumes? Get people hired? Advance state of the art for the collective good? It may be a little of each, but perhaps most importantly it amounts to helping each other out.
The highlight of my conference turned out as the closing day’s workshop, for which I settled on the Deep Learning for Code meetup out of a long list of candidates. These workshops generally follow the form of a dedicated research focus with a handful of invited speakers and accepted paper submissions shared in poster sessions, often with speakers closing out the day in a collective panel discussion. It was this workshop’s panel discussion that will serve as an inspiration for this essay, which featured invited speakers Miltiadis Allamanis, Jacob Andreas, Graham Neubig, David Choi, Yujia Li, Jerry Tworek, and Xinyun Chen — which I will share excerpts along with speaking points from a few other of the dedicated presentations from these same speakers.
In the short history of applying deep learning for code generation, there has been a recent paradigm shift with the progress enabled by massive scaling of large language models towards the application. Consider that when one of the original pre-trained generative transformer architectures GPT-1 was trained in 2018 (with 116M parameters on a 1B token training dataset), it did not demonstrate any capabilities in this domain, nor did it’s more capable 2019 brethren GPT-2 (trained on 1.5B parameters with 10B tokens). However when the third generation’s GPT-3 was scaled up even further to 175B parameters trained on 499B tokens it was a surprise to the community when simple prompt requests outputted code paradigms like Keras, SQL, or even basic Python language with precise, dialect aligned, functional code that in some cases performed just as requested by natural language specification.
The GPT-3 language model wasn’t intentionally trained to generate code, it wasn’t even exposed to a significant corpus of coding language with training, however with such a sizable collection of natural language it organically saw all kinds of discussions and documentation adjacent to the domain, from which it extracted capabilities in the same self-supervised manner of predicting infill to masked tokens based on preceding context from which it impressively learned all those properties of natural language as originally intended.
With the discovery of this emergent capability, it was only natural that those same researchers would attempt to encourage further such capabilities, and it was thus that a similar massively scaled generative pre-trained GPT architecture was trained on a more significant coding language corpus, this time including resources like the full open source catalogue archived in the proprietary repo known as GitHub, resulting in what is now available via the commercialized Codex API, and which will most likely be followed by more open sourced alternates to come.
Although certainly versatile and potentially useful, Codex has not exactly solved the problems inherent in applying deep learning to code — just like GPT-3 is not what could be considered an artificial general intelligence. Codex’s natural language prompt interface allows for providing descriptions of what a function or code snippet should accomplish, like one might detail through coding comments, and the model generates formal language attempting to match such specification. The ginormous caveat is that validation of such generated output remains squarely on the shoulders of the user. Thus for now Codex is probably best reserved as a creativity aid or alternative to stack overflow for overcoming implementation hurdles of micro code segments. There may be possible uses in an educational setting, but in current form the workshop panelists appeared to agree that developing a fundamental understanding might be better served by traditional learning practices. After all, we don’t give students a calculator until after they have learned how to add and multiply.
What kind of capabilities would it take for researchers to consider the applications of coding solved by an artificial intelligence? First of all the incorporation of some model into a software developer’s workflow will need to overcome the simple hurdle that time spent coding with that tool is less than time spent coding without it. Put differently, how much time does it save us? There are a lot of different ways that humans may interact with or integrate such models which remains an open question of what a mainstream form may eventually evolve to look like.
For now the generation capacities of models are reliant on prompt descriptions closely aligned with the level of detail inherent in the code. If one wanted to climb up a few layers of abstraction in their prompt specification, like say the difference between telling a cook to put a pan on the stove top, turn on the heat, pick up a spatula, and etc. verses merely suggesting that he cook an egg, well such hierarchical reasoning is not currently achieved by transformers. They require full detail in their prompts. The modality of natural language has a certain universal connectivity capability that may yet enable mapping routes to climb such hierarchies. Jacob Andreas’ presentation offered further musing of what this might entail, such as allowing the model to build a library of subroutines indexed within natural language’s latent space.
There is a reason coding is conducted with specialized grammar and rules. If we really expect to generate specific desired programs from instructions detailed with fundamentally ambiguous natural language, our models will need some form of capacity to recognize what aspects of a prompt have insufficient detail, and potentially query back to the user for clarification. Such an interactive form of interface may be another solution to this challenge, perhaps enabling the difference between cooking an egg and building an operating system.
Machine generated code would be vastly more valuable if it could be automatically validated as accurate and bug free, after all in mainstream practice most of a developer’s time isn’t spent generating code, it is spent finding and fixing bugs within that code. Such troubleshooting requires first identifying the location of bug source and secondly repairing to the intended form. Current models are more capable at repairing a known bug than localizing the source, which may be especially challenging considering that bugs may be introduced at the interfaces between local code and imported libraries for instance. Localizing the sources of bugs would require a kind of “understanding” beyond what the transformer based architectures are achieving in practice, and Miltiadis Allmanis’ presentation proposed a form of hypergraph attention architecture to promote emergence of such a localization capacity.
In the meantime, with large language models built on pre-trained generative transformer architectures still the most powerful tools at our disposal, a new form of inference has been found to extend capabilities beyond what immediately presents in the few shot learning setting. By the simple act of repeatedly sampling prompted inferences in bulk, the set of generated outputs can then be ranked to hone in on strongest solutions out of a potentially diverse mix of candidates. Paraphrasing an infamous movie line, 100 samples isn’t cool. You know what’s cool? A million samples.
In fact, with a large enough scale of sampling, intermediately parametrized models (like GPT-2) may include in their output solutions approaching capabilities of even the largest models (like GPT-3). The hard part remains sorting the wheat from the chaff so to speak, as identifying a final output of generated code requires a ranking basis. Jerry Tworek described in his presentation a few approaches that could potentially generalize across applications. In traditional natural language applications generated token paths are often sorted by mean of log probabilities, resulting in simply selecting what the model deems as the most likely outputted path of words, which approach may be extended to selecting a likely path of tokenized coding language strings. A more sophisticated convention could be to train a verifier model that evaluates whether sampled generative output could be considered realistic, which in some cases can be established in a not resource intensive fashion.
Another approach to ranking bulk samples was applied in the AlphaCode model targeting competitive coding solutions, which was described in a presentation by Yuja Li. This research found that in many cases, generated code solutions produced output that was basically redundant with other sampled solutions, and a form of majority vote could be applied by grouping sampled solutions on redundancy and selecting a returned candidate from the largest aggregated set. However the most interesting finding from this presentation wasn’t just the manner of selecting samples, it was the demonstration that with the bulk sampled inference, traditional conventions of performance falloff once training reached a validation metric demonstration of overfit did not necessarily hold, as when training was extended deeper into the overfit regime a resulting bulk sampled inference continued to generate coding solutions with improved performance, perhaps demonstrating that some new form of validation is needed for training models intended for inference with bulk sampling.
One metric that the coding competition application may not always fully capture is associated with not just generating code of a described functionality, but doing so in a manner of minimum run time and resource utilization, which objective can be referred to as super optimization. The workshop’s best paper award went to the paper “Learning to Superoptimize Real-World Problems” by Alexander Shypula, Pencheng Yin, Jeremy Lacomis, Claire Le Goues, Edward Schwartz, and Graham Neubig, in which optimization was established by development of a new training set demonstrating representative optimization tactics in application which was used to fine tune a transformer architecture. This approach appeared to outperform the more common heuristic proxy of minimized program length which aligns with intuition for the algorithmic modality, after all an infinite loop can be generated with just a few more characters.
In the end, a true solution to the full application of code generation and software engineering will require a form of learned generalization that can be abbreviated to the acronym AGI. Artificial general intelligence. Given the huge advantages that will result to the first to exceed this threshold, it has become accepted by many important players that it serves the public good to conduct research in an open manner with disclosed findings shared across silos in industry and academia. This is what artificial intelligence research conferences like ICLR are for after all.