OpenAI’s unCLIP Text-to-Image System Leverages Contrastive and Diffusion Models to Achieve SOTA Performance

Published in

SyncedReview

4 min readApr 13, 2022

Contrastive vision-language models such as OpenAI’s CLIP (Contrastive Language–Image Pre-training, 2021) have garnered much attention in the computer vision research community thanks to their impressive capabilities in zero-shot learning and learning robust representations of images that capture both semantics and style. While fine-tuned CLIP models have achieved state-of-the-art performance on a wide range of vision and language tasks without directly optimizing for a given benchmark, recently emerged diffusion models have also shown their potential to push the state-of-the-art on image and video generation tasks.

In the new paper Hierarchical Text-Conditional Image Generation with CLIP Latents, an OpenAI research team combines the advantages of both contrastive and diffusion models for text-conditional image generation tasks. Their proposed unCLIP (so-named as it generates images by inverting the CLIP image encoder) improves image diversity with minimal loss in photorealism and caption similarity, and produces image quality comparable to the state-of-the-art text-to-image system GLIDE.

OpenAI’s unCLIP Text-to-Image System Leverages Contrastive and Diffusion Models to Achieve SOTA Performance

Written by Synced