Transfusion: One Architecture, All Modalities

Towards True Multimodality

7 min readSep 17, 2024

Meta has done it again.

They have presented Transfusion, a new architecture that has fulfilled the dream of many: uniting the worlds of the two dominant architectures, autoregressive models and diffusion transformers, while reaching state-of-the-art performance in both at model size, something that neither OpenAI, Anthropic, nor Google can claim, which all resort to inefficient patched solutions.

But why is this so important and what does this mean to the AI industry?

Get news like this before anyone else by subscribing to my newsletter, the place where analysts, strategists, and executives get answers to AI’s most pressing questions.

TheTechOasis

The newsletter to stay ahead of the curve in AI

thetechoasis.beehiiv.com

Combining State-of-the-Art Models

Today, there are two prominent types of models in AI:

Autoregressive LLM Transformers. Models like GPT-4o or o1, (both available in the ChatGPT platform) generate the output to a user input one token (word/subword) at a time.

Transfusion: One Architecture, All Modalities

Towards True Multimodality

TheTechOasis

The newsletter to stay ahead of the curve in AI

Combining State-of-the-Art Models

Written by Ignacio de Gregorio