Supercharging Large Language Models: DEJAVU’s Inference Time Surpasses FasterTransformer by 2×

Synced
SyncedReview
Published in
3 min readNov 1, 2023

--

Large language models (LLMs), such as GPT-3, PaLM, and OPT, have dazzled the AI world with their exceptional performance and ability to learn in-context. However, their significant drawback is their high cost at inference time. Existing approaches to reduce this cost through sparsity techniques either necessitate expensive retraining, compromise the LLM’s in-context learning capability, or fail to provide the desired speedup on contemporary hardware.

To address these challenges, in a new paper Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, a research team from Rice University, Zhe Jiang University, Stanford University, University of California, San Diego, ETH Zurich Adobe Research, Meta AI (FAIR) and Carnegie Mellon University presents DEJAVU, a system that employs a cost-effective algorithm to predict contextual sparsity dynamically for each layer, combined with an asynchronous and hardware-aware implementation to accelerate LLM inference.

The research team sets out to define the ideal sparsity for LLMs, which should meet three crucial criteria: (i) no need for model…

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global