Multifaceted face generator
Harvard AC215: Advanced Practical Data Science
Author: Dongyun Kim, Vasco Meerman
This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.
Background
With the advent of Artificial Intelligence, it has changed the world in different ways such as an autonomous car in computer vision and language translation in natural language processing. However, AI is only limited to those who can know how to code, have statistical knowledge and mathematics background even though it has a lot of potentials to be used in many fields. In particular, Generative Adversarial Network (GAN) has shown interesting concepts such as latent space and feature extraction, and skilled artists or affordable artists who have in-house engineering teams take advantage of GANs as a tool to help their artworks or create their own artworks, for example, Refik Anadol while normal artists have not had opportunities to take advantage of novel technologies so far.
In this project, we want to open AI gates for those who are marginalized due to the technology gap and help those who are in the blind spot in technology. The application is a new kind of AI-powered creative tool for artists, creators, and the public. No need to have an AI background.
Project statement
Painting real people is called a portrait. The most important thing in a portrait is its recognizability. This means that when one looks at the painting, one should be able to recognize the person who is the model and this is not just about the similarity in appearance. To be a good portrait, the person’s character, knowledge, and achievements must be directly or indirectly revealed. However, portraits are not always realistic. Because a portrait is drawn by order, the request of the client is actively reflected, and the artist’s interpretation and perspective of the character are inevitably buried.
Today, although the medium in which portraits are drawn has changed from analog to digital, the essence of portraits remains the same; It reveals one’s identity. Countless selfies posted on Social Networking Service are intentionally manipulated and reproduced in various ways and selfies began to replace the role of portraits. People spend numerous time modifying their selfie to make it look better and, following this trend, many SNS support built-in retouching functions in their applications. However, this modification could help add decoration but couldn’t create natural changes such as posture, facial expression.
Adjusting an image is easy, such as changing color and blurring boundaries with accessible image edit functions, but modifying the essence of an image (facial expression, emotion, etc.) is not easy. The small and large features of the face harmonize together to form a person, and when one characteristic disappears, this change makes it difficult to recognize the person.
Modifying an image while maintaining the facial features of a person is challenging, but if we can extract and capture these features, it might be possible to generate an image that shows different emotional expressions but can be recognized as the same person.
Explanation of base model
StlyeGAN2 (https://github.com/NVlabs/stylegan2)
An alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied, and high-quality dataset of human faces.
The main feature of styleGAN is that it can capture different sizes of features from a training image set, learn the features and create latent space based on the features. The dimension of the latent vector is 512 so it helps the model store a lot of feature information in latent space. Nvidia published several pre-trained models, including Flickr-Faces-HQ Dataset (FFHQ), cat, church, horse, etc.
The outstanding performance on feature extraction is well aligned with our project motivation, but the current styleGAN model only supports the decoder part so that we can only get randomly generated images, but couldn’t get latent vectors, which is necessary to manipulate the features of the input image.
pixel2style2pixel (https://github.com/eladrich/pixel2style2pixel)
We present a generic image-to-image translation framework, pixel2style2pixel (pSp). Our pSp framework is based on a novel encoder network that directly generates a series of style vectors which are fed into a pretrained StyleGAN generator, forming the extended W+ latent space. We first show that our encoder can directly embed real images into W+, with no additional optimization. Next, we propose utilizing our encoder to directly solve image-to-image translation tasks, defining them as encoding problems from some input domain into the latent domain. By deviating from the standard “invert first, edit later” methodology used with previous StyleGAN encoders, our approach can handle a variety of tasks even when the input image is not represented in the StyleGAN domain. We show that solving translation tasks through StyleGAN significantly simplifies the training process, as no adversary is required, has better support for solving tasks without pixel-to-pixel correspondence, and inherently supports multi-modal synthesis via the resampling of styles. Finally, we demonstrate the potential of our framework on a variety of facial image-to-image translation tasks, even when compared to state-of-the-art solutions designed specifically for a single task, and further show that it can be extended beyond the human facial domain.
The problem mentioned above could be resolved by pixel2style2pixel. It introduced styleGAN encoder so utilizing the model, we are able to get latent vectors of any input image. However, to get a similar image to the original, well-pretrained styleGAN model and pSp model which is trained with the styleGAN model are necessary.
Baseline models trained
The basic workflow of the project is as follows.
- Users upload their selfies.
- Find the latent vectors of the user’s image.
- The user interface supports latent vector manipulation.
- Manipulated vector is converted to the final image.
pixel2style2pixel
Pixel2style2pixel and styleGAN support a pre-trained network trained with FFHQ dataset (Flickr-Faces-HQ dataset at 1024×1024). The domain of the pre-trained dataset is the same as that of this project so the training process is not required. If we can narrow down our target domain such as specific race or gender, we can do additional training to get more reliable result images.
StyleGAN2
After obtaining the latent vectors of the input image, we are able to play with the vector directly. Since the latent vectors are representations of features of the image, if we manipulate the vectors to a certain direction, the converted images will show manipulated features graphically.
The images below are examples of manipulation of the latent vectors. We move input latent vectors to the negative and positive directions of the features. Identified features are Age, Eye distance, Eye eyebrow distance, Eye ratio, Eyes open, Gender, Lip ratio, Mouth open, Nose mouth distance, Nose ratio, Nose tip, Pitch, Roll, Smile, and Yaw.
Main functions
- Upload selfie
When users play with their own images, it tends to make them more involved in the application than using given images. This will create a strong bond between users and the application. When users upload facial images, they are converted to a latent vector that exists in the latent space of styleGAN model which is trained with FFHQ dataset.
- Navigate latent space with features through sliders
We are now supporting 15 features. With simple vector arithmetic, we can add and subtract input image vectors to these feature vectors and feed the result vector into styleGAN generator model so that the image with changed features could be generated. For ease of use, we normalized all the features which range from -1 to 1.
Web application
Scenario
Dongyun wants to see the appearance of his son or daughter even though he doesn’t have a girlfriend. He is a big fan of Brie Larson so he imagines what he/she looks like if he will get married to her and had a baby. He decides to utilize the ‘Multifaceted face generator’ application to predict it. When he uploads his and her images, the model automatically extracts latent vectors from those images and allows him to combine two latent vectors in a certain ratio.
Because the acquired image can also be moved toward a certain vector direction, he moves the vector to the negative age direction to observe a younger mixed image. He also tries to move the vector in the positive direction so he can expect how he/she looks when he/she is old.
Technical challenges
Pixel2style2pixel is only developed by PyTorch, but PyTorch requires specific GPU allocation which might cause some problems when we use GCP. Thus, we need to find some way to bypass the problem such as converting PyTorch model to TensorFlow. We luckily found ONNX which helps convert torch to TensorFlow but the inference time is too slow. We also tried to utilize a small pSp model such as pSp MobileNet, but in this case, the quality of the image which is generated by the obtained latent vector from the input image is too bad. This is because the pre-trained MobileNet model was not fully trained or the performance of the MobileNet model was not as great as the original model.
Acknowledgment
We would like to thank our instructor Pavlos Protopapas and the Harvard Applied Computation 215 course teaching staff for their numerous guidance and support.