Supercharging Large Language Models: DEJAVU’s Inference Time Surpasses FasterTransformer by 2×

Published in

SyncedReview

3 min readNov 1, 2023

Large language models (LLMs), such as GPT-3, PaLM, and OPT, have dazzled the AI world with their exceptional performance and ability to learn in-context. However, their significant drawback is their high cost at inference time. Existing approaches to reduce this cost through sparsity techniques either necessitate expensive retraining, compromise the LLM’s in-context learning capability, or fail to provide the desired speedup on contemporary hardware.

To address these challenges, in a new paper Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time, a research team from Rice University, Zhe Jiang University, Stanford University, University of California, San Diego, ETH Zurich Adobe Research, Meta AI (FAIR) and Carnegie Mellon University presents DEJAVU, a system that employs a cost-effective algorithm to predict contextual sparsity dynamically for each layer, combined with an asynchronous and hardware-aware implementation to accelerate LLM inference.

The research team sets out to define the ideal sparsity for LLMs, which should meet three crucial criteria: (i) no need for model…

Supercharging Large Language Models: DEJAVU’s Inference Time Surpasses FasterTransformer by 2×

Written by Synced