A Picture is Worth a Thousand Words: This Microsoft Model can Generate Images from Short Texts
Microsoft Research created a generative model that can combine text and image analysis.
I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
(Core ML concepts + groundbreaking research papers and frameworks + AI news and trends) x 5 minutes, 3 times a week =…
Humans build knowledge in images. Every time we are presented with an idea or an experience, our brain immediately formulates visual representations of it. Similarly, our brain is constantly context switching between sensory signals such as sound or texture and its visual representations. Our ability to think in visual representations has not quite expanded to artificial intelligence(AI) algorithms. Today, most AI models are highly specialized on one form of data representations such as image, text or sound. Eventually, we will start seeing forms of AI that can efficiently translate between different data formats in order to optimize the creation of knowledge. Recently, AI researchers from Microsoft published a paper proposing a method for generating images based on short texts.
Our ability of generating visual representations from vocal or textual descriptions is one of the magic elements of human cognition. If you are asked to draw an image of a basketball game, you are probably going to start with an outline of three or four players positioned at the center of the canvas. Even if it wasn’t directly specified, you might add details such as the crow, the referee or the player in a specific shooting position. All of those details enrich the basic textual description in order to fulfill our visual version of basketball game. Wouldn’t it be great if AI models could do the same? Text-to-Image(TTI) is one of the emerging disciplines of deep learning that focuses on generating images from basic textual representations. While the TTI space is in very early stages, we are already seeing some tangible progress with some models that have proven proficient in very specific scenarios. However, the are very specific challenges in TTI models that still need to be addressed.
Generating Images from Text: Challenges and Considerations
There are several relevant challenges that have traditionally blocked the evolution of TTI models but most of them can be categorized in one of the following groups?
1) The Dependency Challenge: Obviously, TTI models are highly dependent on both textual and visual analysis techniques which, although they have made a lot of progress in recent years, have a lot of work to do in order to achieve mainstream adoption. From that perspective, the capabilities of TTI models are typically hindered by the specifics of the underlying text analysis and image generation models.
2) Concept-Object Relationship: An incredibly hard problem to be solved in TTI models is the relationships between a concept extracted from a textual description and its corresponding visual objects. Practically speaking, there could be infinitive number of objects that match a specific textual description. Figuring out the right match between concepts and objects remains the pivotal challenge in TTI models.
3) Object-Object Relationship: Any image expresses relationships between objects in a visual format. To reflect a given narrative, a TTI model wouldn’t only have to generate the correct objects but also the relationship between them. Generating more complex scenes containing multiple objects with semantically meaningful relationships across those objects remains a significant challenge in text-to-image generation technology.
Object-Driven Attentive GAN
To address some of the traditional challenges of TTI models, Microsoft Research relied on the increasingly popular generative adversarial networks(GANs) techniques. GANs typically consists of two machine learning models — a generator that generates images from text descriptions, and a discriminator that uses text descriptions to judge the authenticity of generated images. The generator attempts to get fake pictures past the discriminator; the discriminator on the other hand never wants to be fooled. Working together, the discriminator pushes the generator toward perfection. Microsoft innovates on traditional GAN models by including a bottom-up attention mechanism. The Obj-GAN model develops an object-driven attentive generator plus an object-wise discriminator, thus enables GANs to synthesize high-quality images of complicated scenes.
The core architecture of Obj-GAN performs TTI synthesis in two steps:
1) Generating a Semantic Layout : This phase includes the generation of elements such as class labels, bounding boxes, shapes of salient objects, etc. This functionality is accomplished by two main components: Box Generator and Shape Generator.
2) Generating the Final Images: This functionality is accomplished by an attentive multi-stage image generator and also a discriminator.
The following figure provides a high level architecture of the Obj-GAN model. The model receives as an input a sentence with a set of token which is then encoded as word vectors. After that, the input get’s processed through the three main stages: box generation, shape generation and image generation.
The first step of the Obj-GAN model takes the sentence as input and generates a semantic layout, a sequence of objects specified by their bounding boxes. The model’s box generator is responsible for generating a sequence of bounding boxes which are then used by the shape generator. Given a set of bounded boxes as an input, the shape generator predicts the shape of each object in its corresponding box. The shapes produced by the shape generator are then used by the image generator GAN model.
Obj-GAN includes an attentive multistage image generator neural network based on two main generators. The base generator first generates a low-resolution image conditioned on the global sentence vector and the pre-generated semantic layout. The second generator then refines details in different regions by paying attention to most relevant words and pre-generated class labels and generates a higher resolution.
By now you might wondering where does the adversarial component of the architecture comes to play? That’s the role of the object-wise discriminator. The role of this component is to act as an adversary to train the image generator. The Obj-GAN model include two main discriminators:
· Patch-Wise Discriminator: This discriminator is used to train both the box and shape generators. The first discriminator tries to evaluate whether the generated bounding boxes correspond to a given sentence while the second discriminator does the same to evaluate the correspondence between bounding boxes and shapes.
· Object-Wise Discriminator: This discriminator uses a set of bounding boxes and object labels as an input and tries to determine whether the generated images correspond to the original description.
The use of adversarial generator-discriminator duos for box, shape and image generation given Obj-GAN an edge over other traditional TTI methods. Microsoft evaluated Obj-GAN against state-of-the-art TTI models and the results were remarkable. Just take a look at the difference in the quality of the generated images and its correspondence to the original sentences.
The ability to create visual representations of a given narrative will be an important focus of the next generation of textual and image analysis deep learning models. Ideas such as Obj-GAN certainly bring relevant innovation into this area of the deep learning space.