YOCO, A New Foundation Model to Eliminate Transformers?

You Only Cache Once

13 min readJun 2, 2024

Over the last few years, AI has consistently beaten the dead horse over the same technological advances that appeared more than five years ago, particularly the creation of the Transformer architecture that fuels not only ChatGPT or Gemini, but also Sora or Stable Diffusion.

Despite their undeniable prowess, Transformers' language modeling approach is extremely expensive and unorthodox. This forces the companies behind Large Language Models (LLMs) to spend billions, with a b, running these models.

Now, Microsoft has proposed a new type of architecture, YOCO. But this is not your ordinary ‘architecture that kills ChatGPT,’ which you always know isn’t true even before you start reading.

This architecture takes another, more interesting approach that, fascinatingly, leads to up to three orders of magnitude (100x) reductions in memory requirements and latency while being competitive, performance-wise, with current models, which sounds like a miracle.

But how?

You are probably sick of AI newsletters talking about how this or that **just** happened. These newsletters abound because coarsely talking about events and things that already took place is easy, but the value provided is limited and the…

YOCO, A New Foundation Model to Eliminate Transformers?

You Only Cache Once

Written by Ignacio de Gregorio