Imagen: The Answer To Dalle-2 by Google
Recently, DALLE-2 was released, showing off a great improvement from last year’s DALLE. In fact, DALLE-2 shows improvements in generating photorealistic images from text, with 4-times the resolution of its predecessor.
As you might see, and that’s what Google Brain’s team attacked with, DALLE-2 lacks of realism, a problem which was solved by Imagen.
An overview
Before diving into the diffusion model, which is the core of these kind of algorithms, let’s understand how the input is treated.
Google Brain used a huge text model, similar to GPT-3, to understand and extract information from the text. So, instead of training a text model along with the image generation model, like its predecessor, they used a big pre-trained model and froze it so that it didn’t change during the training of the image generation model.
From their study, this led to much better results, and it seemed like the model “understood” text in a better way.
Once text encodings are obtained, the diffusion model comes into play.