Member-only story
DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints
Recent advancements in training large multimodal models have been driven by efforts to eliminate modeling constraints and unify architectures across domains. Despite these strides, many existing models still rely on separately trained components such as modality-specific encoders and decoders.
In a new paper JetFormer: An Autoregressive Generative Model of Raw Images and Text, a Google DeepMind research team introduces JetFormer, a groundbreaking autoregressive, decoder-only Transformer designed to directly model raw data. This model maximizes the likelihood of raw data without depending on any pre-trained components, and is capable of both understanding and generating text and images seamlessly.
The team summarizes the key innovations in JetFormer as follows:
- Leveraging Normalizing Flows for Image Representation: The pivotal insight behind JetFormer is its use of a powerful normalizing flow — termed a “jet” — to encode images…