Gender Swap GAN Filter for Deep Art Effects

Akvelon, Inc.

Ilya Polishchuk

Published in

Deep Art

6 min readSep 2, 2020

Thanks to Gleb Vasyagin, Anton Nesterenko for their help writing this article.

Introduction

In April 2020, Akvelon released Deep Art Effects, an application that allows you to edit photos by applying AI-powered filters and effects. Deep Art is accessible via theses apps: web, iOS, and Android applications:

‎Deep Art Effects

‎Edit your photos with Deep Art to transform them into pieces of art. Apply filters that make your photos appear to…

apps.apple.com

Deep Art Effects - Apps on Google Play

Edit your photos with Deep Art to transform them into pieces of art. Apply filters that make your photos appear to have…

play.google.com

Web application: https://deep-art.k8s.akvelon.net/

The Deep Art Effects application launched with Akvelon’s custom trained Gender Swap GAN filter that transforms the face of a subject in a photo to look like the opposite gender by altering their features to look more masculine or feminine. Here, we are going to describe the methodology and approach that we used to develop this filter.

We tried a few novel techniques to train a Gender Swap GAN filter. In general, we tried to use generative adversarial networks in order to generate images of people with opposing genders.

Generative adversarial network or GAN is a type of neural network that is composed of two different parts — generator and discriminator. During the training procedure, both of these components take part in the learning procedure by “competing” with each other. The generator’s task is to create an image that will suit our given parameters and the discriminator’s task is to decide if the image is generated by the network or not. The discriminator is rewarded for correctly determining if a photo was synthetically generated or not and the generator is rewarded for successfully “fooling” the discriminator.

Our first task was to prepare a large dataset that contained numerous photos of people’s faces. Although there are plenty of prepared photo datasets on the internet, they are not always suitable for a particular task like this or they may have poor image quality.

For quality control of target images, we used the RetinaFace neural net that allowed us to crop faces from high-quality images of human subjects for further editing. We also shared this data with the InsightFace gender detection model to verify the target subject’s gender. This was necessary for us to ensure that the neural net will correctly distinguish the style of the target’s gender.

With all of our data prepared, the question arose: which particular GAN architecture should we use for a successful solution to this task?

We tried two different approaches to the training process: supervised and unsupervised.

This picture shows the main difference between the two types of training. In the first case, the supervised models are trained on images that were paired with each other by context or even just visually. In the other case, the unsupervised model training tries to learn how to transfer style between two unpaired images.

Unsupervised Training Approach

In this approach, we used images of men and women that were randomly selected from our dataset. By loading these pairs of images into the network it learned how to transfer the style of one image to the other and how to describe images of people of one gender in latent space and find the correlation between them. For this approach, we decide to use a CycleGan modification called UNIT as the neural net architecture.

CycleGAN is a type of unsupervised style transfer network. It is a modified implementation of GANs that has the ability to transfer the style of one picture to another.

CycleGAN consists of two generators and two discriminators. The first generator of this network transfers the style from pictures labeled A to the pictures labeled B. The second generator tries to transfer them vice versa — from B to A. Each of the discriminators decide if labeled styles A and B really correspond to the generated images. CycleGAN got its name from the additional loss function that controls the consequential application of two generators B2A(A2B(x)) and A2B(B2A(x)) and comparison of generated images with the initially-given images to the network.

For example here are some style transfers that were used in the CycleGAN paper:

Transferring the “style” of a zebra to a horse picture.
Painting oranges to look like apples and vice versa
Making real photos look like painted ones

Despite having some successful advances in unsupervised training during our research, we decided to explore a supervised approach that’s explained next.

Supervised Approach

The supervised approach required training data that is comprised of pairs of photos, each pair contains one example of a person depicted as a male and one example of a person depicted as a female with both photos in the pair being a depiction for the same person. Preparing such a data set is obviously challenging as such pairs of photos are not readily available. The main task of this type of neural network was to find some function that will describe the correlation between an arbitrary photo of a person looking differently within the pair of photos and then can change the gender of a person on-the-fly. This approach requires drastically better, more finely, and prepared data. We used pix2pixHD as the neural net architecture for this approach.

Pix2pix is a conditional GAN framework for image-to-image translation. It consists of a generator G and a discriminator D. For our task, the objective of the generator G is to translate semantic label maps to realistic-looking images, while the discriminator D aims to distinguish real images from the translated images. The framework operates in a supervised setting. In other words, the training dataset is given as a set of pairs of corresponding images.

Direct application of the pix2pix framework to generate high resolution and quality images is not possible due to its unstable training process, so in our second approach, we tried a different GAN architecture that is called pix2pixHD by NVIDIA.

The main difference between pix2pix and pix2pixHD is that with pix2pixHD, the generator is divided into two parts that allow for the generation of high-resolution images.

Results

The main difference between these two methods is obvious: the supervised method requires more carefully and sometimes handpicked data of people that represent two opposing genders of each person but with better-tuned results and although the unsupervised method doesn’t require paired images, it suffers from insufficient training capabilities.

During our research and development, we came to the conclusion that for gender-swapping it is more effective to use the supervised method with more carefully prepared data.

Here are a few samples of gender swaps that are made with Deep Art Effects.

References

[1]: Ming-Yu Liu, Thomas Breuel, Jan Kautz, “Unsupervised Image-to-Image Translation Networks” Arxiv, 2018.

[2]: Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks” ICCV 2017.

[3]: Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros. “Image-to-Image Translation with Conditional Adversarial Nets”. CVPR 2017

[4]: Ting-Chun Wang1, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, Bryan Catanzaro. “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”. CVPR 2018

Pictured above is the team from Akvelon’s Ivanovo office that developed ML part of the Deep Art

This work was conducted at Akvelon, Inc. Akvelon is a custom software development company specializing in the development, testing, consulting and support of web, cloud, desktop and mobile solutions.