ICLR2017 — deep thought vs exaflops
Many of the papers now judged most original and significant rely on massive compute resources, usually beyond the financial reach of academia. So where does that leave academic research?
Last week, I travelled to the South of France (such a hardship) to attend deep learning conference ICLR (International Conference on Learning Representations*).
Lots to take away, but one thing struck me in particular — many of the papers deemed most significant relied on massive compute resource that is usually unavailable to academics. I wondered — what avenues remain open for compute-limited academic contribution? I try to answer that question below. (I also wondered whether this will soon be a moot question, when GAFA et al have finished recruiting all the academics!).
Top marks awarded to corporate-affiliated research
Submissions to the conference were sorted — based on quality, clarity, originality and significance — into oral presentations, conference posters, workshop posters, and rejections. Three oral papers were further awarded ‘best paper’ status.
- Out of 451 paper submissions, 15 were selected for oral presentation, another c.230 as conference- or workshop-track posters (they’re all listed here).
- Of the 15 oral papers, only 3 had exclusively academic authors. One of these was awarded a ‘best paper’ prize.
- In contrast, Google Brain + Deepmind researchers co-authored 6 oral papers, followed by Facebook (3), Intel (2), Twitter (2), and Uber (1).
I’m going to go out on a limb here and assert that researchers in corporations are no more likely to write well than those in universities. So, corporate success must come down to the ability to ask (and answer) a wider range of original and significant questions. We’re used to the idea that recent breakthroughs in machine intelligence have relied on large datasets; we’re seeing ever more clearly that there is also some reliance on “big compute”. Taking Google Brain as an example, what university can afford to conduct experiments like these..?:
- Neural Architecture Search with Reinforcement Learning (oral) used 800 GPUs in its experiments.
- Capacity and Trainability in Recurrent Neural Networks (conference poster) used “CPU-millennia worth of computation” (otherwise known as, “a ridiculous amount”).
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (conference poster) used a GPU cluster to train models with >100bn parameters.
- Massive Exploration of Neural Machine Translation Architectures (not an ICLR submission, a more recent submission to ACL 2017) reports empirical results corresponding to over 250,000 GPU hours.
It seems that there are limits to the expense and time that even Google Brain is willing to go to, to strengthen results though. The following papers were co-authored by Google Brain researchers, but took a more pragmatic view of the cost-result trade-off:
- Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization (conference poster) “we chose our comparison set to be as informative as possible given the high cost of running these experiments (the total cost was over 10k in EC2 credits and the CNN experiments took over 10k GPU hours).”
- Revisiting Distributed Synchronous SGD (rejected): “We agree with the reviewer that our results could be strengthened by averaging over multiple runs. Unfortunately, doing so is rather expensive — 10 runs of the Inception experiments could cost ~150,000 GPU hours”.
This inequality of resource (not unique to the field of course) makes it harder for up-and-coming researchers to make a mark without seeking corporate sponsorship; and for those academics that prefer not to align with the commercial world it’s harder to make a contribution.
Then there’s all the chat about democratising AI that is belied by this inconvenient reliance on hardware and/or data. (That is in no way meant to denigrate any of the wonderful open-source packages and their hard-working authors and contributors).
So, are there other ways?
Innovating on a budget
Ever more complex architectures, ensembles of models, and big hyper-parameter searches are being fed in to big compute (critically labelled ‘just brute force’ or ‘building a taller ladder to get to the moon’). But that doesn’t mean that there aren’t original and significant questions to investigate that needn’t rely on proprietary “big” data and big GPU farms…
As ever, inspiration comes from the human brain, which does not need lots of data and repetition to learn, and can readily build on concepts and make connections between domains (“generalization” and “transfer learning”). Researchers in the field of neural program induction see that a key part of this capability amounts to learning to write programs, e.g.:
- “Recursion divides the problem into smaller pieces and drastically reduces the domain of each neural network component” (Making Making Neural Programming Architectures Generalize via Recursion, oral, best paper award).
- the “ability to distil knowledge into subcomponents that can be shared across tasks” (Lifelong Perceptual Programming By Example).
- “by composing low-level programs to express high-level programs” (Neural Program Lattices).
Right now, research is focused on learning ‘text-book’ algorithms — such as ‘grade-school’ addition and sorting routines. This seemed totally counterintuitive to me — the idea of learning programmes from data for which we already have optimal solutions. Why not just present those as units of prior knowledge? But I see that this may be missing the point — the simplest, perfectly generalizable, learned programs may show the way to being able to compose much more complex ones, and towards AGI. Luckily one can easily generate training data for these simple tasks too!
Much more on this topic can be found via the NIPS2016 Neural Abstract Machines and Program Induction Workshop website.
“Computers will learn to program themselves. It will happen” (Alex Graves, Google Deepmind, at ICLR2017)
Other methods of incorporating prior knowledge into models seem intuitively likely to yield results. These could be hybrid approaches that ally deep learning with symbolic AI (like Marta Garnelo, @mpshanahan & @KaiLashArul’s Towards Deep Symbolic Reinforcement Learning paper from last year) or which learn structure to help with reasoning tasks (Learning Graphical State Transitions). Surely some such approaches will have the potential to reduce compute and data requirements?
There are other pressing research questions around how we implement machine learning systems that we can rely upon. A fascinating invited talk from Benjamin Recht (@beenwrekt) put learning theory centre-stage, arguing that getting a better theoretical understanding of deep learning is vital to improve trust, scaleability, and predictability.
“Stability and robustness are critical for guaranteeing safe, reliable performance of machine learning” (Benjamin Recht, Berkeley, at ICLR2017)
Perhaps the most discussed paper at ICLR this year was a prize-winning look at the generalization properties of deep neural architectures (Understanding Deep Learning Requires Rethinking Generalization). This evaluated the question through experimentation; the fact that nobody seemed to entirely agree on interpretation of the results suggests that we’re in need of some strictly analytical insight too!
The same could be said for questions of algorithmic accountability, bias, and privacy too.
So, there you go, a few areas of pivotal research which needn’t require vast data or vast compute — the induction of simple programs, hybrid models, and analytical theories of generalization in learning. Do you agree? Are there other areas? I’d love to hear your views.
(*) For those new to machine learning, the conference title itself may need some explanation. Here’s what the conference website says about learning representations:
“The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. The rapidly developing field of representation learning is concerned with questions surrounding how we can best learn meaningful and useful representations of data. We take a broad view of the field”