Matrix Multiplication-Free Language Models Maintain Top-Tier Performance at Billion-Parameter Scales

Synced
SyncedReview
Published in
3 min readJun 9, 2024

--

Matrix multiplication (MatMul) is a fundamental operation in most neural networks, primarily because GPUs are highly optimized for these computations. Despite its critical role in deep learning, MatMul operations are a significant source of computational expense, often consuming the bulk of execution time and memory access during both training and inference phases.

In a new paper Scalable MatMul-free Language Modeling, a research team from University of California, Santa Cruz, Soochow University, University of California, Davis and LuxiTech introduces the first scalable MatMul-free language model (MatMul-free LM). Their findings demonstrate that it is possible to completely eliminate MatMul operations from large language models (LLMs) while maintaining robust performance, even at billion-parameter scales.

The MatMul-free LM achieves this by employing additive operations in dense layers and element-wise Hadamard products for self-attention-like functions. Specifically, ternary weights are used to eliminate MatMul in dense layers, similar to binary neural networks (BNNs). To remove…

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global