Transfusion: One Architecture, All Modalities

Towards True Multimodality

Ignacio de Gregorio
7 min readSep 17, 2024

Meta has done it again.

They have presented Transfusion, a new architecture that has fulfilled the dream of many: uniting the worlds of the two dominant architectures, autoregressive models and diffusion transformers, while reaching state-of-the-art performance in both at model size, something that neither OpenAI, Anthropic, nor Google can claim, which all resort to inefficient patched solutions.

But why is this so important and what does this mean to the AI industry?

Get news like this before anyone else by subscribing to my newsletter, the place where analysts, strategists, and executives get answers to AI’s most pressing questions.

Combining State-of-the-Art Models

Today, there are two prominent types of models in AI:

  • Autoregressive LLM Transformers. Models like GPT-4o or o1, (both available in the ChatGPT platform) generate the output to a user input one token (word/subword) at a time.

--

--