DeepMind’s Zipper: Fusing Unimodal Generative Models into Multimodal Powerhouses

Synced
SyncedReview
Published in
3 min readJun 1, 2024

--

From text and proteins to audio, images, and state sequences, decoder-only generative models have proven their ability to generate new sequences across various modalities. However, integrating multiple generative foundation models, especially those trained on different modalities, into a cohesive and superior system presents significant challenges.

In a new paper Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities, a Google DeepMind research team introduces Zipper, a multi-tower decoder architecture. This architecture can flexibly combine multimodal generative models from independently pre-trained unimodal decoders and can be reused and repurposed in new multimodal combinations.

Similar to vocabulary expansion techniques, Zipper can perform generative tasks across all modalities. Unlike vocabulary expansion techniques, Zipper is more flexible and composable, allowing unimodal backbones to be pretrained independently from multimodal alignment fine-tuning while preserving unimodal performance by freezing the corresponding backbone.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global