Improving PULSE Diversity in the Iterative Setting

Sasha Sheng
The Startup
Published in
8 min readJul 14, 2020

--

By Sasha Sheng and Diogo Almeida

Disclaimer: Opinions belong to the authors of this blog post only and do not reflect the opinions of our employers. Bias in machine learning certainly is a larger structural issue that cannot be solved by purely technical means. This blog post only focuses on the narrow computer vision task of adapting PULSE to generate more diverse faces.

Motivation

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models unintentionally caused some heated debate in the machine learning community recently. The conversation that ensued on Reddit inspired us to embark on a weekend-long hack project. Specifically, the second point in the below screenshot got our creative juices flowing and we set out to run several experiments to increase racial diversity in PULSE’s super-resolution generations. We were pleasantly surprised that several experiments led to more diversity in age and gender, but unfortunately not race. This is further evidence that reducing the bias of machine learning models is a hard problem, and is not possible to fully address with only algorithmic changes.

Reddit comment in response to “How would you fix PULSE?”
The tweet that caused a lot of discussion showing a low-resolution photo of Barack Obama upsampled to what looks like a white person

Quick Review

Previous super-resolution techniques start with a downsampled image and try to add real detail. PULSE constrains images to a GAN’s outputs and searches for an image that matches the downsampled image.

PULSE introduces a new approach to super-resolution by way of latent space search under high-dimensional Gaussian priors (see Figure 4 from the original paper, above). Their methods focused on finding points (through search) in the latent space of the generative model that map to realistic outputs. They sample an initial value for the latent code from a normal distribution and minimize the L2 distance between the downsampled generated super-resolution image and the reference low resolution image provided.

PULSE generates images that are much more natural-looking, due to the fact that the search space is constrained with the natural image prior.

Multi-output super resolution is valuable

Super resolution is an ill-posed task, since one low resolution image could have multiple high resolution images that can be downscaled to match. While the PULSE paper focuses on showcasing quality and realness of the generated super resolution image from low-res images, we focus on exploring diversity given their low-res counterpart.

We think that people might not be satisfied with just one generation for super resolution. What if they wanted different kinds of faces? We think that it would be desirable to have multiple plausible up-samplings. Our method generates faces that are more diverse and have more variance in attributes, which allows users to be able to select a super resolved image that fits their unique requirements.

Sneak Peak of our Results

Left: Vanilla PULSE; Right: Our Method: Iterated PULSE. Original Image: Oprah (see Methods & Experiments section)

We develop an iterative version of PULSE, where each iteration generates a different super-resolved image by taking into account the latent codes of previously generated images. This method shows attributes that are qualitatively correlated with different axes of diversity (such as age and gender). However, we’ve found our method does not lead to more racial diversity. As such, our method should not be construed as addressing the biases in the original PULSE model. You can check out our code here if you want to try it for yourself.

Left: Vanilla PULSE; Right: Iterative PULSE (the best setting). The results on the right is more diverse. See the intro of Methods & Experiments section to see the original Oprah Image.

Follow along into the Experiments section if you are interested in how we did it! :)

Methods & Experiments

Motivated by the heated debate around race, we specifically picked images from racial minorities. Here we show two examples: Oprah and an unidentified female image from reddit, along with high resolution versions generated by vanilla PULSE. From the two images, we see that even though the downscaled versions are similar, the high resolution images are quite different.

In this section, we describe several versions of our iterative PULSE method that we went through before getting to the final version, that has both high-quality generations and higher diversity outputs.

Top Left: the aligned original image; Top Right: Downscaled image; Bottom Left: Generated image from PULSE; Bottom Right: Downscaled Generated Image; The left Image is Oprah and the right image is an unknown model from Reddit;

Iterative Negation Initialization

Let’s say we want to generate 2 up-sampled images, conditioned on a low-res image, that are different from each other. We can generate the 1st image by simply running PULSE. What can we do to make sure the 2nd generated image is as different as possible from the 1st image?

We started the exploration with the simplest thing we could think of: initialize the latent code of image 2 with the negation of the final latents of image 1. Image 2’s search would then start in the opposite region of the hypersphere, and make it extremely unlikely we will get close to the final latents of image 1. This sort of worked: the images were different, but terrible! (See the 2nd and 4th images below, for example).

What if only image 2 is bad and the rest turn out okay though? We generalized the intuition of this to have the next latent code be initialized as the negation of the average of all previous latents (controlling for the norm) and the results are below. As you can see the results are somewhat more diverse, but not good (and even somewhat freaky). The quality seems to oscillate, possibly bouncing between regions of good and bad images.

Generations from iterative negation initialization. (Generation order: Left to right, top to bottom); Original Image: Oprah

Farthest Sampled Initialization

To avoid the oscillation problem mentioned above, we sampled 10k potential initial latent values and chose the one farthest away from the previous latents.

The results showed significant improvements in quality compared with iterative negation initialization, though only a slight improvement on diversity compared to vanilla PULSE.

Farthest Sampled Initialization. (Generation order: Left to right, top to bottom); Original Image: Oprah

Iterative Perpendicular Initialization

Negation perhaps was too strong and caused unwanted oscillations. So we tried initializing on directions perpendicular to all previously generated latent values, which seemed to solve both of these issues. We did this by sampling latent values from the normal distribution and removing its corresponding projection onto each one of the previously generated latent values (controlling for the norm, so that the latents fall onto the same hypersphere). Because the latents are in 512-dimensional space, there can be a maximum of 512 different perpendicular directions, and this puts a limit on how many images this method can generate. Luckily, this number is way higher than the number we’ve ever wanted to generate.

The results showed more improvements in diversity than Farthest Sampled Initialization (best so far with minimum quality degradation). Surprisingly, we see that the second image looks like a kid!

Iterative perpendicular initialization. (Generation order: Left to right, top to bottom); Original Image: Oprah

Perpendicular Projection Optimizer

Iterative Perpendicular Initialization was a success. Perpendicularity (it’s a real word, apparently) is the key. We want to dial the knob of perpendicularity to the maximum. Just initializing the latent values to be perpendicular might not be enough, since during the search we could just go any direction. What if we can constrain the latent value of each optimization step to be perpendicular to their previously optimized value? We performed the perpendicular projection (same as the one we used in iterative perpendicular initialization) during each optimization step, after the original update but before PULSE’s projection into a hypersphere.

The results are interesting. Not only do we see features that are visually correlated with age, but also gender! For example, beard, a feature that is typically correlated with men, is present in image 7 and perhaps image 18, while image 13, image 16, image 19 and image 24 look youthful.

Perpendicular Projection Optimizer, (Generation order: Left to right, top to bottom); Original Image: Oprah

ψ from StyleGAN to the rescue

We were pretty happy with the diversity improvements from the last experiment, but felt we could work on the realism / quality.

ψ determines how far away the generated face is from the mean latents for all faces. When ψ → 0, all faces converge to the average faces. Interpolation towards it reduces artifacts. Therefore, we tweaked the StyleGAN ψ to be 0.7 (as opposed to 1 by PULSE) to improve the realness.

Perpendicular Projection Optimizer without (Left) and with (Right) ψ=0.7, (Generation order: Left to right, top to bottom); Original Image: unknown model shown in the beginning of the Methods and Experiments section. We used this image here because it is less in distribution (due to facial alignment), thus magnifying the effect of increasing the realism with ψ

Discriminator Loss improves realness (only slightly)

How else can we improve realness? We experimented with taking advantage of the discriminator from StyleGAN and added an additional loss term maximizing the predicted realness by the pre-trained discriminator. So what do the results look like?

To be honest, we stared at the two pairs (with and without discriminator loss) across various base image for quite some time. Qualitatively, we think that they have similar diversity. (If we wanted to publish this, we’d probably get AMT’s opinion and measure diversity and quality with numbers.) With the help of a magnifying glass and some eye drops, we concluded that the discriminator loss improved the realness of the images slightly. (For example, the bottom left generation from the previous image showed artifacts that do not exist in the generations below).

Perpendicular Projection Optimizer with ψ=0.7 without (Left) and with (Right) disc loss. (Generation order: Left to right, top to bottom); Original Image: Oprah

Conclusion

Our final result — incorporating all the best performing techniques led to improvements in diversity in age and gender, while failing at the original motivation of generating more racially diverse faces. It’s hard to say for certain why this is the case — but our results do imply that for this pre-trained StyleGAN-FFHQ model, not all forms of diversity are equally easy to generate (and partitioned equally in latent space). One hypothesis is that racial diversity may be mostly limited to a single direction (or very few directions) in latent space, and constraining results to be perpendicular actually prevented us from exploiting that dimension (ie., to generate more faces with features correlated with that of people of color).

Top Left: the aligned original image; Top Right: Downscaled image;
Bottom Left: Generated image from PULSE; Bottom Right: Downscaled Generated Image;
The iterated PULSE generated results that are of poor racial diversity. The results were run on Obama. Vanilla pulse on the left; Iterative Pulse on the right (best setting applied — perpendicular projection optimizer with ψ=0.7 and disc loss;

Other than the shortcoming described above, we’re pleasantly surprised with the results. We’ve shown perpendicularity to be an excellent prior for generating more varied images from a GAN, played with a few axes for increasing the realism of those varied images, and together do much better than PULSE in the iterated setting.

Given all that, we aren’t done investigating this problem and we do have some additional hypotheses on how we could potentially improve racial diversity in the generation outputs — perhaps topics to discuss it in another blog post.

Try it out yourselves and tell us how it worked for you!

Special thanks to Devi Parikh, Joelle Pineau and Ryan Lowe for reviewing

--

--