Multifaceted face generator

Dongyun Kim

Published in

Institute for Applied Computational Science

8 min readDec 15, 2021

Harvard AC215: Advanced Practical Data Science
Author: Dongyun Kim, Vasco Meerman

This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.

Background

With the advent of Artificial Intelligence, it has changed the world in different ways such as an autonomous car in computer vision and language translation in natural language processing. However, AI is only limited to those who can know how to code, have statistical knowledge and mathematics background even though it has a lot of potentials to be used in many fields. In particular, Generative Adversarial Network (GAN) has shown interesting concepts such as latent space and feature extraction, and skilled artists or affordable artists who have in-house engineering teams take advantage of GANs as a tool to help their artworks or create their own artworks, for example, Refik Anadol while normal artists have not had opportunities to take advantage of novel technologies so far.

Figure 1. Refik Anadol, Quantum Memories, 2021.

In this project, we want to open AI gates for those who are marginalized due to the technology gap and help those who are in the blind spot in technology. The application is a new kind of AI-powered creative tool for artists, creators, and the public. No need to have an AI background.

Project statement

Painting real people is called a portrait. The most important thing in a portrait is its recognizability. This means that when one looks at the painting, one should be able to recognize the person who is the model and this is not just about the similarity in appearance. To be a good portrait, the person’s character, knowledge, and achievements must be directly or indirectly revealed. However, portraits are not always realistic. Because a portrait is drawn by order, the request of the client is actively reflected, and the artist’s interpretation and perspective of the character are inevitably buried.

Figure 2. Vincent van Gogh, Self-Portrait, 1889.

Today, although the medium in which portraits are drawn has changed from analog to digital, the essence of portraits remains the same; It reveals one’s identity. Countless selfies posted on Social Networking Service are intentionally manipulated and reproduced in various ways and selfies began to replace the role of portraits. People spend numerous time modifying their selfie to make it look better and, following this trend, many SNS support built-in retouching functions in their applications. However, this modification could help add decoration but couldn’t create natural changes such as posture, facial expression.

Figure 3. Instagram hashtag search result, #selfie.

Adjusting an image is easy, such as changing color and blurring boundaries with accessible image edit functions, but modifying the essence of an image (facial expression, emotion, etc.) is not easy. The small and large features of the face harmonize together to form a person, and when one characteristic disappears, this change makes it difficult to recognize the person.

Modifying an image while maintaining the facial features of a person is challenging, but if we can extract and capture these features, it might be possible to generate an image that shows different emotional expressions but can be recognized as the same person.

Explanation of base model

StlyeGAN2 (https://github.com/NVlabs/stylegan2)
An alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied, and high-quality dataset of human faces.

The main feature of styleGAN is that it can capture different sizes of features from a training image set, learn the features and create latent space based on the features. The dimension of the latent vector is 512 so it helps the model store a lot of feature information in latent space. Nvidia published several pre-trained models, including Flickr-Faces-HQ Dataset (FFHQ), cat, church, horse, etc.

The outstanding performance on feature extraction is well aligned with our project motivation, but the current styleGAN model only supports the decoder part so that we can only get randomly generated images, but couldn’t get latent vectors, which is necessary to manipulate the features of the input image.

pixel2style2pixel (https://github.com/eladrich/pixel2style2pixel)
We present a generic image-to-image translation framework, pixel2style2pixel (pSp). Our pSp framework is based on a novel encoder network that directly generates a series of style vectors which are fed into a pretrained StyleGAN generator, forming the extended W+ latent space. We first show that our encoder can directly embed real images into W+, with no additional optimization. Next, we propose utilizing our encoder to directly solve image-to-image translation tasks, defining them as encoding problems from some input domain into the latent domain. By deviating from the standard “invert first, edit later” methodology used with previous StyleGAN encoders, our approach can handle a variety of tasks even when the input image is not represented in the StyleGAN domain. We show that solving translation tasks through StyleGAN significantly simplifies the training process, as no adversary is required, has better support for solving tasks without pixel-to-pixel correspondence, and inherently supports multi-modal synthesis via the resampling of styles. Finally, we demonstrate the potential of our framework on a variety of facial image-to-image translation tasks, even when compared to state-of-the-art solutions designed specifically for a single task, and further show that it can be extended beyond the human facial domain.

The problem mentioned above could be resolved by pixel2style2pixel. It introduced styleGAN encoder so utilizing the model, we are able to get latent vectors of any input image. However, to get a similar image to the original, well-pretrained styleGAN model and pSp model which is trained with the styleGAN model are necessary.

Figure 5. Pixel2style2pixel architecture.

Baseline models trained

The basic workflow of the project is as follows.

Users upload their selfies.
Find the latent vectors of the user’s image.
The user interface supports latent vector manipulation.
Manipulated vector is converted to the final image.

pixel2style2pixel

Pixel2style2pixel and styleGAN support a pre-trained network trained with FFHQ dataset (Flickr-Faces-HQ dataset at 1024×1024). The domain of the pre-trained dataset is the same as that of this project so the training process is not required. If we can narrow down our target domain such as specific race or gender, we can do additional training to get more reliable result images.

StyleGAN2

After obtaining the latent vectors of the input image, we are able to play with the vector directly. Since the latent vectors are representations of features of the image, if we manipulate the vectors to a certain direction, the converted images will show manipulated features graphically.

The images below are examples of manipulation of the latent vectors. We move input latent vectors to the negative and positive directions of the features. Identified features are Age, Eye distance, Eye eyebrow distance, Eye ratio, Eyes open, Gender, Lip ratio, Mouth open, Nose mouth distance, Nose ratio, Nose tip, Pitch, Roll, Smile, and Yaw.

Figure 8. Manipulation of latent vectors in the latent space.

Figure 9. The image moved to the negative direction of ‘age’ feature (Left), original image (Center),
The image moved to the positive direction of ‘age’ feature (Right).

Main functions

Upload selfie

When users play with their own images, it tends to make them more involved in the application than using given images. This will create a strong bond between users and the application. When users upload facial images, they are converted to a latent vector that exists in the latent space of styleGAN model which is trained with FFHQ dataset.

Navigate latent space with features through sliders

We are now supporting 15 features. With simple vector arithmetic, we can add and subtract input image vectors to these feature vectors and feed the result vector into styleGAN generator model so that the image with changed features could be generated. For ease of use, we normalized all the features which range from -1 to 1.

Web application

Scenario

Dongyun wants to see the appearance of his son or daughter even though he doesn’t have a girlfriend. He is a big fan of Brie Larson so he imagines what he/she looks like if he will get married to her and had a baby. He decides to utilize the ‘Multifaceted face generator’ application to predict it. When he uploads his and her images, the model automatically extracts latent vectors from those images and allows him to combine two latent vectors in a certain ratio.

Figure 10. Mixing two latent images through vector arithmetic.

Because the acquired image can also be moved toward a certain vector direction, he moves the vector to the negative age direction to observe a younger mixed image. He also tries to move the vector in the positive direction so he can expect how he/she looks when he/she is old.

Figure 11. ‘age’ manipulation of mixed images.

Technical challenges

Pixel2style2pixel is only developed by PyTorch, but PyTorch requires specific GPU allocation which might cause some problems when we use GCP. Thus, we need to find some way to bypass the problem such as converting PyTorch model to TensorFlow. We luckily found ONNX which helps convert torch to TensorFlow but the inference time is too slow. We also tried to utilize a small pSp model such as pSp MobileNet, but in this case, the quality of the image which is generated by the obtained latent vector from the input image is too bad. This is because the pre-trained MobileNet model was not fully trained or the performance of the MobileNet model was not as great as the original model.

Acknowledgment

We would like to thank our instructor Pavlos Protopapas and the Harvard Applied Computation 215 course teaching staff for their numerous guidance and support.