In order to understand a scene, each piece of visual information has to be associated with an entity while considering the spatial information. Several other challenges have emerged to really understand the actions in an image or a video: keypoint detection, action recognition, video captioning, visual question answering and so on. A better comprehension of the environment will help in many fields. For example, an autonomous car needs to delimitate the roadsides with high precision in order to move by itself. In robotics, production machines should understand how to grab, turn and put together two different pieces required to delimitate the exact shape of the object.
In this blog post, we’ll discuss how Segmentation is useful and in particular how an image Segmentation can be done using Generative Adversarial Networks instead of widely used techniques like Mask-RCNN, U-Net, etc.
Generative Adversarial Network (GAN)
GAN has shown great results in many generative tasks to replicate the real-world rich content such as images, human language, and music. It is inspired by game theory: two models, a generator and a critic, are competing with each other while making each other stronger at the same time.
GAN consists of two models:
- A discriminator D estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones.
- A generator G outputs synthetic samples given a noise variable input z (z brings in potential output diversity). It is trained to capture the real data distribution so that its generative samples can be as real as possible, or in other words, can trick the discriminator to offer a high probability.
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image. It can be applied to a wide range of applications, such as collection style transfer, object transfiguration, season transfer, and photo enhancement.
The task here is to get accurate segmentation from left image to right image.
pix2pix is one of the very famous and extensively used GAN architecture for any task of Image-to-Image Translation.
pix2pix are conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. Here this approach is effective at Segmentation and we no longer have to hand-engineer mapping function.
Pix2Pix - Image-to-Image Translation Neural Network
Pix2pix architecture was presented in 2016 by researchers from Berkeley in their work "Image-to-Image Translation with…
The Generator has the job of taking an input image and performing the transform to produce the target image. The encoder-decoder architecture consists of:
An example input could be an image (black and white), and the output of that image is to be a colorized version. The structure of the generator is called an “encoder-decoder,” and in pix2pix the encoder-decoder looks more or less like this:
The Discriminator has the job of taking two images, an input image and an unknown image (which will be either a target or output image from the generator), and also decide if the other image was produced by the generator or not. The 70×70 discriminator architecture is:
Following were the hyperparameters used during training.
Given some more training and better hyperparameter tuning, we can achieve almost accurate results.