YOCO, A New Foundation Model to Eliminate Transformers?

You Only Cache Once

Ignacio de Gregorio
13 min readJun 2, 2024
Generated by the author using GPT-4o

Over the last few years, AI has consistently beaten the dead horse over the same technological advances that appeared more than five years ago, particularly the creation of the Transformer architecture that fuels not only ChatGPT or Gemini, but also Sora or Stable Diffusion.

Despite their undeniable prowess, Transformers' language modeling approach is extremely expensive and unorthodox. This forces the companies behind Large Language Models (LLMs) to spend billions, with a b, running these models.

Now, Microsoft has proposed a new type of architecture, YOCO. But this is not your ordinary ‘architecture that kills ChatGPT,’ which you always know isn’t true even before you start reading.

This architecture takes another, more interesting approach that, fascinatingly, leads to up to three orders of magnitude (100x) reductions in memory requirements and latency while being competitive, performance-wise, with current models, which sounds like a miracle.

But how?

You are probably sick of AI newsletters talking about how this or that **just** happened. These newsletters abound because coarsely talking about events and things that already took place is easy, but the value provided is limited and the

--

--

Ignacio de Gregorio

I break down frontier AI systems in easy-to-understand language for you. Sign up to my newsletter here: https://thetechoasis.beehiiv.com/subscribe