In 2011, Marc Andreesen famously wrote that software was eating the world. Nine years later, Large Scale AI is enabling “real, high-growth, high-margin, highly defensible businesses”. To sustain that growth, Large Scale AI has started eating chips as witnessed in the challenges and solutions presented at Hot Chips 32. Here is a (too) short summary of the presentations that provided the most context to this evolving trend.
Hot Chips 32 was the first virtual Hot Chips Symposium, covering a wide variety of topics: from quantum computers to server and mobile processors, GPUs and FPGAs. A significant portion of the program was dedicated to Machine Learning chips and systems, with a tutorial on Machine Learning Scale Out, and the afternoon of Day 2 on AI training and inference chips.
LightOn participated in the meeting with a poster titled “Light-in-the-loop: using a photonics co-processor for scalable training of neural networks” where we showcased early results of how our photonic AI chip can be used to train neural networks, including modern deep learning architectures such as graph convolutional neural networks. We showed the first practical demonstration of training neural networks with light — previous attempts had only been simulated, and never actually executed at scale. Stay tuned: we have much more coming along these lines.
Larger Models require Machine Learning Scale Out
The tutorial day started with presentations by NVIDIA, Google and Cerebras on Machine Learning Scale Out systems, followed by three presentations on scale out training experiences for large language models and recommender systems.
As Machine Learning models grow larger — up to a whopping 700GB for GPT-3 — it has become increasingly harder to train these models efficiently. Indeed, recent deep learning models are so large they cannot fit in the memory of a single compute instance anymore. To address this challenge, two approaches are emerging: A distributed and a monolithic approach.
The distributed approach organizes model learning through large clusters such as the DGX A100 Pod or the TPUv3 Pod. Vast amounts of data have to be exchanged within the pods while remaining in sync. The communication bottleneck imposes a strict limit on model size as exemplified in GShard — Google’s own GPT-like attempt — where researchers report failing to train the 1 trillion parameter model.
Cerebras is taking a monolithic approach and has been designing a single giant chip capable of holding much larger models. While the execution of such a massive chip is a challenge on its own — dynamically attributing resources, dealing efficiently with the thermals, ... — the solution opens new exciting possibilities while simplifying the engineering work required by reducing the number of nodes involved to train a single model.
“Giant neural networks are awesome.” — From the conclusion of the GShard presentation during Tutorial day.
Much like what happened at the beginning of the Deep Neural Network period, a few years ago, we are witnessing a new period for machine learning / AI with the rise of very large-scale models that have potentially trillions of parameters or more. Models such as OpenAI’s GPT-3 or Google’s GShard or T5 are potentially as industry transformational as the DNNs for classification were in the early 2010s. The chip industry is taking notice.
What’s LightOn take on approaching the training of these models and future ones: Distributed or Monolithic?
“When you come to a fork in the road, take it.” — Yogi Berra
LightOn’s approach is to go both ways. Our Aurora 1.5 OPUs is already performing random projections between input and output with one million dimensions. We’ll need a few of those to train these new emergent large models. Details will be coming later.
Build it quickly…
The afternoon of Day 2 saw the presentation of Google TPUv2 and TPUv3. The talk was fast-paced and entertaining. It was particularly interesting to see the definitions of the key goals of the TPUv2 project, the first goal of the project was to build it quickly.
“A development day saved is a deployment day earned.” — From the TPUv2 and v3 presentation.
As a result, after the deployment of V2, the TPU team had quite a few low-hanging fruits that they could reap to get a performance increase in V3.
LightOn also believes in fast prototyping and a short time to market cycles. Being fabless and relying on commercial off-the-shelf components, we were able to deploy the world's first optical co-processor on the cloud in less than a year.
Giant neural networks are awesome but scaling out is a challenging engineering problem. At LightOn, we believe that current and future OPUs can help address these challenges. In the meantime, come on LightOn Cloud to see how to make Machine Learning at scale a reality with our products!
LightOn is a hardware company that develops new optical processors that considerably speed up Machine Learning computation. LightOn’s processors open new horizons in computing and engineering fields that are facing computational limits. Interested in speeding your computations up? Try out our solution on LightOn Cloud! 🌈
Igor Carron, CEO and co-founder, LightOn, Julien Launay, Machine Learning R&D Engineer at LightOn AI Research, Iacopo Poli, Lead Machine Learning Engineer at LightOn AI Research, and Victoire Louis, Community Builder and Marketing Manager at LightOn.