LightOn’s Summer Series #1 — Faith No Moore: Silicon Will Not Scale Indefinitely

8 min readAug 9, 2019

Welcome to LightOn’s Summer Series! Throughout the end of summer and the beginning of fall, we take you on a tour of our unique technologies and what motivated their development. At LightOn we are developing novel optics-based computing hardware leveraging natural physical processes to perform high-dimensional computations at unprecedented speed and power efficiency. Our hardware accelerators are uniquely suited to tackle the most demanding machine learning applications. In this series, we will venture to the edge of existing hardware capabilities — and beyond! — and we will showcase how light can be used to unlock new possibilities in machine learning.

For our first installment, we will focus on the challenges raised by the ongoing machine learning revolution. In particular, we will examine the limitations in existing silicon-based hardware.

Software, data, hardware: the triumvirate of machine learning

5 years of progress in Generative Adversarial Networks. — **Figure 1.** 5 years of progress in generative adversarial networks, from their inception, to BigGAN on the right. First four samples from [1–4] trained on the Toronto Face Dataset and CelebA, reaching 1024x1024 resolution for the 2017 one; last from [5] trained on ImageNet.

Machines that learn and software that adapts to our needs are growing ubiquitous, bringing transformative changes to countless industries and services. Once constrained to toy problems, machine learning (ML) applications have now spread to the real world, from powering smart assistants [6, 7], to improving medical care [8], or even helping researchers uncover new insights in materials sciences [9]. This ongoing revolution is being driven by three concurrent technological factors:

Smarter algorithms: refinements of existing statistical methods and brand-new frameworks have empowered machines with an unprecedented ability to draw complex insights from a wealth of data. On top of providing us with the tools to distill our complex world, novel architectures like the transformer [10] or generative adversarial networks [1] (Figure 1) enable machines to create new content.
Abundant data: ever-growing sources of high-quality data have become available to feed these algorithms. Privacy issues notwithstanding, this is a boon for a field in which additional high-quality data directly translates into better performance.
Cheap compute: the staggering increase in the amount of compute available has transformed our devices into relentless data-crunchers. Thanks to dedicated chips, even mid-range smartphones now have ML-processing capabilities; thereby allowing for such algorithms to be applied in an ever-expanding variety of contexts.

These factors are deeply intertwined. Sometimes, they compensate for one another: smarter solutions may be able to learn complex representations with fewer samples on a tight compute budget. Yet, often, they amplify each other: larger data combined with more advanced algorithms require more expensive computations. To eventually fulfil its promises, the machine learning revolution needs all of these three drivers.

So long, and thanks for all the transistors!

**Figure 2.** Evolution of single-thread performance against a 1978 baseline. What we are facing is a beast with many names — end of Moore’s Law, of Dennard scaling, Landhauer’s limit, etc. — but whose wrath has already measurable consequences.

However, one of these workhorses is currently facing trouble. Deep learning most fashionable successes have been requiring an exponential increase in compute: indeed, resources needed for state-of-the-art (SOTA) algorithms are estimated to double every 3.5 months [11]. For Moore’s Law to be relevant, chip foundries and manufacturers have to find ways to pack more and more transistors into the dies of processors. As they struggle with increasing power densities (Dennard Scaling) and more expensive and finer lithography requirements, manufacturers instead turn to enhanced parallelism, by multiplying cores.

For deep learning applications, this is a boon: the most fundamental operation behind neural networks is matrix multiplication, which can take advantage of distributed computing. Accordingly, the trend of adding cores has furthered. First with the wide adoption of GPUs, and now with new hardware accelerators. Carefully crafted chips incorporate the core principles of modern machine learning at the transistor level, enabling more than 100x speedups compared to a general-purpose processor.

Not only has this allowed the industry to compensate for diminishing returns on newer generations of processors, but it has also shown that task-specific hardware was more relevant than ever. Now, dozens of startups and large companies alike are designing chips tailored to artificial intelligence applications.

Yet, these approaches are insufficient:

Hard physics barriers: silicon-based electronics are bound by stringent physical laws. On the one hand, thermal noise at the quantum scale forbids access to smaller engravings. The ever-approaching Landauer’s limit places a strict limit on further shrinking of electronics. On the other hand, as additional cores bring more thermal strain, the laws of thermodynamics become hard constraints. Cooling components so closely packed together —while guzzling kilowatts— is a non-trivial exercise.
Communication bandwidth: moving billions of ones and zeroes between disk, memory, and dedicated chips is proving to be a challenging bottleneck. In practice, this issue means CPUs are still the preferred device for some memory-hungry applications [21].

Accordingly, exponential scaling of electronics is not a free lunch anymore [12, 13] (Figure 2).

A game of tug-of-war: bite-sized models vs no-limit architecture search

Towards lean machine learning

**Figure 3.** Pruning a neural network consists in removing neurons and connections that are of little importance to the end result [17]. Through this procedure, model size can often be reduced by nearly a factor 10 without compromising performances.

The machine learning community has not remained indifferent to these concerning hardware trends. A number of countermeasures have been devised to permit the exponential scaling of compute to continue — if only for a bit.

Perhaps the most effective practice has been to switch to lower precision arithmetic. The standard number format for high-performance computing (HPC) has long been floating-point with 32 bits — or 64 bits for double precision, and even 128 bits for demanding simulations. However, machine learning is but statistics on noisy data. Half-precision (16 bits) — or even less — will do just as well [14]. And indeed, a significant part of the recent progress in hardware benchmarks is due to this switch: eye-catching performance metrics are often based on half-precision computations.

This movement toward lower precision is only breaking ground: at the cutting-edge of research, practitioners are looking into making neural networks able to rely integers [15]; or even quantized coefficients, down to ternary or binary numbers [16].

What’s more, pruning techniques have grown trendy — an extension of a lasting inclination towards sparse data structures. Indeed, while over-parametrized architectures may be required to find a good set of weights at training time, this requirement can be lifted at inference [17]. Swathes of neurons and/or connections can be outright removed, thereby enabling large speed-ups and model compression at little or no performance cost. These approaches are key to enabling ML-computing on the edge, in a larger variety of devices.

More compute is all you need… if you can afford it

**Figure 4.** The Transformer [10] is a staple of modern natural language processing, relying on attention-powered encoders and decoders to process complex inputs.

Yet, at the same time, other trends are making compute needs skyrocket. The situation, akin to Jevons paradox, is almost as if any gain in efficiency is promptly compensated by these new practices.

Neural architecture search (NAS) has been booming in recent years, achieving SOTA performances in computer vision and natural language processing. The practice is also controversial: using thousands of GPUs/TPUs for countless hours, the performance gains obtained are often marginal. Worse, individual models, such as XLNet [18], are ballooning in complexity, with one-time training costs in the $250,000 range. Thus, even well-funded universities struggle to find resources to match private labs.

In response to this trend, transfer learning has grown more common, thereby allowing practitioners to leverage the progress of pre-trained SOTA models for specific tasks. However, the stakes go beyond raw computing power and money within the community. Global environmental concerns are also coming to bear as compute-hungry approaches have non-negligible carbon footprints. For instance, Data Centers’ energy consumption have already passed that of air traffic [20]. As a result, there is a sense that practitioners have to more clearly expose the compute requirements of their practices, and maybe even include an estimation of their environmental impact [19].

The new paradigms

In order for the machine learning revolution to continue unabashed, new computing paradigms are required. These new paradigms should both scale to tremendous amounts of high-dimensional data, and do so on a tight energy budget. In our next installment, we will explore a general background for such a paradigm: photonics/optical computing. And through this series, we will show how, at LightOn, this is not just a distant prospect: it’s already here and at scale.

Our upcoming installments for this summer include:

1 — Faith No Moore: Silicon Will Not Scale Indefinitely (this post)
2 — Optical Computing: a New Hope
3 — How I Learned to Stop Worrying and Love Random Projections
4 — Random Projections at the Speed of Light: Full Ahead Mr. Sulu, Maximum Warp

Stay updated on our advancements by subscribing to our newsletter. Liked what you read and eager for more? You can check out our website, as well as our publications. Seeing is believing: you can request an invitation to LightOn Cloud, and take one of our Optical Processing Unit (OPU) for a spin. Want to be part of the photonics revolution? We are hiring!

References

[1]: Ian Goodfellow et al. Generative Adversarial Networks. NeurIPS, 2014.

[2]: Alec Radford et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ICLR, 2016.

[3]: Ming-Yu Lia et al. Coupled Generative Adversarial Networks. NeurIPS, 2016.

[4]: Tero Karras et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR, 2018.

[5]: Andrew Brock et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis. ICLR, 2019.

[6]: Yuxuan Wang et al. Tacotron: Towards End-to-End Speech Synthesis. Interspeech, 2017.

[7]: Eric Battenberg et al. Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis. Preprint, 2019.

[8]: Nenad Tomašev et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572:116–119, 2019.

[9]: Keith T. Butler et al. Machine learning for molecular and materials science. Nature 559:547–555, 2018.

[10]: Max Jaderberg et al. Spatial Transformer Networks. NeurIPS, 2015.

[11]: Dario Amodei et al. AI and Compute. OpenAI Blog, 2018.

[12]: Chuck Moore. Data Processing in Excascale-Class Computer Systems. The Salishan Conference on High Speed Computing, 2011.

[13]: Venkatramani Balaji. Machine Learning and the Post-Dennard Era of Climate Simulation. 42nd ORAP Forum, AI for HPC and HPC for AI, 2018.

[14]: Suyog Gupta et al. Deep Learning with Limited Numerical Precision. ICML, 2015.

[15]: Dipankar Das et al. Mixed Precision Training of Convolutional Neural Networks using Integer Operations. ICLR, 2018.

[16]: Matthieu Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. NeurIPS, 2016.

[17]: Song Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR, 2016.

[18]: Zhilin Yang et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Preprint, 2019.

[19]: Emma Strubell et al. Energy and Policy Considerations for Deep Learning in NLP. ACL, 2019.

[20]: Nicola Jones. How to stop data centres from gobbling up the world’s electricity. Nature, 2018.

[21]: Yu Emma Wang et al. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. Preprint.

The author

Julien Launay, Machine Learning R&D engineer at LightOn AI Research.