FaceApp; or, How I Learned to Stop Worrying and Love the Machines

17 min readAug 17, 2017

Mirror mirror on the wall. Who has the fleekest eyebrows of them all?

What. A. Mitzvah. A current event that combines two of my favorite things: artificial intelligence and allegations of racism.

Let me first point out that all versions of Cate Blanchett are quite gorgeous, which is a non-trivial statement. Facial transformations generally sputter out in the uncanny valley, but FaceApp’s tech leaps out of said valley and straight into our hearts.

The basis behind this technology is the paper Invertible Conditional GANs for Image Editing. I’m going to give a layman’s explanation of this paper, skedaddling through the chain of citations with an attention to rigor that could only be described as “reckless.” Buckle up.

Let’s start with the problem.

When we’re talking racial transformation tech, what exactly is it that we’re trying to build? Well, the interface is quite simple. We want to take in an image and a characteristic, race in this case, in order to create a new image that “looks like” the original color image, except with the characteristic applied.

# (Cate Blanchett, Asian) in => (Asian Cate Blanchett) outdef race_transform(image: Image, characteristic: Race): Image

FaceApp, revealing your inner Korean since 2017.

If my use of “looks like” sounds hand-wavey, that’s because it totally is. There really isn’t a notion of “correct” for this problem. It just needs to “look right.” This is the disturbing thing about machine learning. The programs it produces have no clear sense of correctness like with classical programming.

This painting, Treachery of Images by surrealist René Magritte, is a great analogy for how to think about correctness and machine learning.

When I first encountered this painting, referenced in an internet meme, I thought it was some hipster nonsense. “This is not a pipe? Go fuck yourself, you bohemian-ass beatnik.” But… this is actually not a pipe. It’s a representation of a pipe. I can’t take the image and smoke a bunch of [tobacco use only] with it. It’s simply a collection of pixels that my brain recognizes as a pipe, even when it’s not.

“Pipe” itself is a concept built by years and years of experiences with these smoking devices. When I experience a specific type of perturbation on my optic nerve, it triggers this concept of “pipe.” It doesn’t matter whether it’s a real pipe or a shitty pixelated *.jpeg of a pipe, the concept is triggered regardless. Such is the treachery of images.

It is these representations that machine learning attempts to create. The classical programming approach to “pipe” is to analyze it and break it down into characteristics. A pipe must have a stem. The radius of the bowl of the pipe must be at least 3.5 times than the radius of the stem. This “build from first principles” approach works in some cases, but falls flat on its ass with tasks like image interpretation.

First principles ain’t gonna do shit for understanding this picture.

The “machine learning” approach, on the other hand, takes the tabula rasa of your brain and throws experiences at it until it builds up a usable representation of “pipe.” There’s no explanation for “pipe.” It just is.

You could say that machine learning is existentialist. It produces programs that do not have an absolute, objective sense of correctness, only usefulness with respect to a specific goal.

Let’s take a step back. What is machine learning?

In a very simple sense, yes, machine learning systems are just piles of linear algebra. It’s a mathematical structure similar to a function, with inputs and outputs, but also a set of internal parameters that we can “stir”, or tune, to our choosing. For example, if I want to train a hot dog recognition function, I have the input as an image, the output as the classification [hot dog, not hot dog], and a set of parameters. As a “data scientist,” it’s my job to define these parameters and to “stir” them until I get the answers I want.

The most common type of pile these days, especially with image processing, is a neural network. There’s too much to say about neural networks and it’s not particularly relevant to the higher level structures that we’re talking about, but if you’re interested, I encourage you to read these two blogs: 1, 2. For the moment, just think of neural networks as this “pile” that we “stir” to get a function that generally returns the “right” answer.

The takeaway here is that in machine learning, we’re trying to find a function, a representation of the world mapping inputs to outputs, that will help us accomplish a particular task. The function could be as simple as XOR, or it could be as complex as converting the image of a white person to the image of an Asian person.

It’s incredibly inefficient to randomly stir the pile until it “looks right.” As we mutate our parameters, we want a sense of what the “right direction” is. Thus, neural networks are attached to a loss function, which calculates the “error,” or loss, of the network as a whole. To get the best answer, we try to minimize the loss calculated by this loss function. This is what we mean by usefulness with respect to a specific goal.

Most machine learning problems are supervised, which means we use a training data set that is labelled with the “right” answers to tune the system. For the hot dog example, we would have pictures of hot dogs and then pictures of … not hot dogs. Another example would be classifying images of handwritten digits, the MNIST problem, where you would have pictures of the digits and labels with the correct numeric classification. For all supervised training datasets, each training input has a unique “right” answer attached to it.

For training, we start with a randomly initialized set of parameters for our neural network and then start feeding these examples in to calculate a baseline loss, which is some difference between the “right” answer and the answer our neural network spat out. From this loss, we adjust our parameters to move towards a better loss.

If we have two parameters, our loss function would look something like this, with the vertical dimension being loss.

Side note: most useful neural networks will have thousands to millions of parameters. Its loss function would be a function in million-dimensional space. Unfortunately, I left my million-dimensional pen at home so we’ll have to make due with two.

And suppose we’re initialized at this point.

We don’t know what the loss function looks like globally, but we do know the “contours” of the slope around us. We want to move towards lower levels of loss by adjusting our parameters in discrete steps.

Every step we take and every move we make, we should be getting closer to an optimum.

This class of algorithms is called gradient descent. Gradient descent is often referred to as rolling a stone down a mountain to get to the lowest valley. Much like this metaphor, it’s a very finicky process. For one, we can hit local optima, points with “contours” that indicate they are optimal in a certain neighborhood but are not optimal globally.

And much like The Police, we can overshoot optima if the steps we take are… just a little too abrupt and creepy.

These are all configurations that need to be tuned in our algorithm, but this strategy generally leads to great results for supervised machine learning problems. For every iteration of our algorithm, we’re using the contours of our loss function to inch closer and closer to a minima of the loss function. At this minima, our neural network should be returning back answers that perform well with respect to our loss function.

Unfortunately, for our problem of racial transformation, we don’t have a typical training data set labelled with the right answers. Ask yourself: what does “Latina Cate Blanchett” look like? Unlike labels such as [hot dog/not hot dog] or [0…9], there is never a clear, unique answer. “Asian Brad Pitt” could either look like this or like this or an infinite number of other possibilities. If we don’t have any sense of what the right answer is, how can we create a label specifying that it is the only right answer? All the outputs we’re creating here are hypothetical transformations.

This is an unsupervised machine learning problem. Unsupervised machine learning is often described as a problem where we try to suss out hidden structure from the data. It’s also helpful to think of it as any problem where it’s impossible or impractical to specify a “right” answer for a particular input. This precludes the possibility of simple loss functions that merely compare the “right” answer to what the neural network returns.

That said, we don’t want to build a bunch of garbage, so we still need some kind of quality control on the fake data produced by our unsupervised system. With unsupervised problems, we need to invent our own loss function. For a problem like clustering, we could use something like sum of squares distance of every point from the center of its cluster. For anomaly detection, we could use something like the variance of non-anomalous data. In many respects, we have to just make this up as we go along, but there are some cute strategies we can use to make our lives easier for tough problems like racial transformation.

Enter Generative Adversarial Networks (GANs for short.)

This was a landmark paper from 2014 that reframes the unsupervised problem of generating fake data into a semi-supervised problem. The original version from this paper only generates an image and doesn’t take into account any categories like race. It has a very simple interface.

# Just... make a fake imagedef gan(): Image

With GANs, instead of training a single neural network, we train two neural networks. This is where the supervision comes in. There’s a generator, which creates fake data, and a discriminator, which distinguishes between fake data and real data.

# The generator takes in a noise vector, a list of random numbers, and spits out an image. The noise is there to "keep things interesting." Without any inputs into our system, we want to generate random faces instead of generating the same face every single time. More on this noise vector later.def generator(noise: Vector[Float]): Image

# The discriminator takes in an image and spits out whether or not the image is fake.def discriminator(image: Image): Boolean

The discriminator is trained like a typical supervised neural network, using as its training dataset fake images generated from the generator and real images from our unlabelled dataset. Its loss function is whether or not it guessed correctly, let’s say percentage of images it got right.

The generator is trained using the output of the discriminator. Its loss function is the inverse of the performance of the discriminator on fake images; it is essentially “punished” when the discriminator correctly guesses whether or not it is making fake images.

This creates a competitive process between the two networks. We are pitting them against each other so that as the discriminator gets better and better, the generator is forced to get better and better to lower its loss.

This is very handwavey, so let’s step through the full process.

Step 1: We randomly initialize the parameters of both the generator and discriminator.

Step 2: We create a dataset of 100 fake images using the generator and feed these 100 fake images, along with 100 real images, into the discriminator as a labelled training dataset.

Step 3: Using the results of the discriminator run and the real answers, whether or not the images are actually fake, we calculate a discriminator loss based on the differences. Using this discriminator loss, we can update the parameters of our discriminator.

Step 4: We generate another dataset of 100 fake images from the generator. We feed these fake images into the discriminator and get back some guesses on whether they are fake. Based on how many the discriminator marks as fake, which is our generator loss, we update the parameters of the generator.

Step 5: Repeat steps 2-4 until GAN is sufficiently trained.

The results of this training process are quite remarkable. The generators start with creating just noise, but after running through several iterations we start to see substantial improvements in the fake data being generated.

Samples from a GAN’s first 220 iterations. Some Shroud of Turin shit right there.

Run to sufficiency, we start to get some “reasonable” looking faces.

A conditional GAN extends the vanilla generative adversarial network by adding a categorical input. In our case, we’re using race.

def conditional_gan(category: Race): Image

We want to train a system so that we can produce Asian-looking faces given an “Asian” input category. This is a non-trivial extension and requires a change to both the generator and the discriminator.

# For the generator, given a racial category and noise, we want to create a fake image that not only looks like a face, but a face from that particular racial category.def conditional_generator(race: Race, noise: Vector[Float]): Image

# For the discriminator, given an image and a race category, we want to know whether or not that image is fake.def conditional_discriminator(race: Race, image: Image): Boolean

To train, we first want to make sure our training data is categorized by race. From there, we run through the same training process as a standard GAN, but with the categories added.

Step 1: We initialize generator and discriminator parameters randomly.

Step 2: We create fake images from the generator. This time we add in categories. We want to randomly select the categories, as per this paper, instead of using the categories from the training dataset. We feed the fake images with their categories into the discriminator, along with the real images and their categories.

Step 3: After feeding both the real and fake images into the discriminator, we compare the results of that discriminator run with the actual labels for whether the images were fake or not. This discriminator loss calculation is fed back and used to update the parameters of our discriminator. Note that the labels are whether or not the image is fake, NOT the racial category.

Step 4: We generate another dataset of 100 fake images from the generator. We feed those fake images, along with their categories, into the discriminator. Based on how many it marks as fake, we update the parameters of our conditional generator.

Step 5: Repeat steps 2-4 until conditional GAN is sufficiently trained.

With a sufficiently trained conditional GAN, we can now generate faces that look like a particular race. We’ve taught our neural network what faces look like and even what faces from a particular race look like. No small feat. However, we’re still missing one critical component.

Let’s take a look at our desired interface for racial transformation. We can feed categories into our problem, but how do we feed our input image?

Remember how the generator takes in a chunk of noise?

What if we didn’t think about this as noise? What if we instead thought of it as a compressed version of the image, as if we ran it through *.zip compression, got this gobbledygook binary file that looks like garbage, and then decompressed it to get back a lossy version of the original. The binary *.zip file in the middle looks like noise, but it has plenty of meaning in the context of our problem.

With our generator, we want the noise to be the underlying structure of the face, with category information stripped out. It is essentially the platonic ideal of Cate Blanchett’s face. Cate Blanchett, with all the whiteness stripped out, if that makes any sense at all.

Mapping the noise and a race classification to a fake image is a deterministic process. What we need to do is reverse the process for a real image so that we can encode a real image into a noise vector. We can feed the results of the encoder along with the desired classification back into the conditional GAN to generate our desired image.

Something something, another Hollywood Joke.

We need to train another neural network, one that encodes an image into a noise vector. We also want to extract out the category. We don’t use the category that our encoder spits out, but we want to keep our noise vector as pure as possible, unpolluted by categorical information.

# Encode an image into a noise vector and a racial classificationdef noise_encode(image: Image): (Vector[Float], Race)

Training this encoder is an interesting process. Let’s start with thinking about how our optimal encoder would look like. I propose that it should be something like this:

Original Cate Blanchett is white Cate Blanchett. Our goal here is to be able to generate the exact same face from our encoded noise. Any differences between the original image and the generated image can be treated as errors. Thus, our loss function for the encoder will be a combination of category loss, whether it was able encode the category correctly, and noise loss, the difference between the original image and the image generated by the conditional GAN using the encoded noise and the original category.

Training this encoder is a bit tricky. Before we begin, we first need a conditional GAN trained to sufficiency. From there:

Step 1: We initialize our encoder.

Step 2: We run a batch of images through the encoder, getting back a set of noise vectors and a set of categories.

Step 3a: In order to calculate category loss, we simply look at the differences between encoded categories and actual categories.

Step 3b: In order to calculate noise loss, we run the encoded noise with the original categories through our trained conditional GAN. We then compare the original pictures with the generated pictures and calculate the loss.

Step 4: We take the overall loss on the encoder pass and use that to update the parameters on the encoder.

Step 5: Repeat steps 2-4 until the encoder is sufficiently trained.

Step 6: With a sufficiently trained encoder and conditional GAN, we can simply encode the original image into a noise vector (discarding the category output,) plug that noise vector into the conditional GAN with the desired racial category, and get back our "magical" results.

Hopefully, you now have a rough understanding of a cutting edge deep learning technique. What we often find is that deep learning techniques are quite simple conceptually. Even the most sophisticated techniques are constructed from basic building blocks. The real difficult work is in tuning these neural networks. Different loss functions, different network architectures, different optimization methods can each have a huge impact on the overall result.

Deep learning is still in its infancy as an engineering discipline. I liken it to Japanese blacksmiths making katanas in the 16th century. There is a great deal of craft and understanding what works, but the understanding of why it works is lacking. Yes, we all know to fold that steel 16 times, but we don’t know that it works because we’re removing impurities and creating uniform atomic lattice structures. Progress at this point relies on trial and error, but data scientists are figuring out some unifying ideas like skip connections.

What is truly exciting is the potential convergence of deep learning and computational neuroscience. Deep learning is figuring out how to build simple brains from scratch, while computational neuroscience is looking at the most complex brain (I would assume) and trying to figure out how it works. As both disciplines advance, we’ll see an increasing amount of cross-pollination between them. Is there a mathematical principle behind Hebbian plasticity? What is attention? What is it about the neural architecture of our cerebral cortex that makes it so good at representing complex ideas? These are questions we may be able to answer in the next decade.

As for general AI, I don’t think we’re anywhere close. That said, progress in science is very non-linear and a single discovery can completely change the landscape. Our blocker right now is that we fundamentally don’t understand how decision making works.

Specifically, we don’t understand how slow decision making works. While our neural networks can learn very sophisticated representations of the world, they don’t understand how to use representations to reason about other representations. Fancy AIs that pwn n00bs can react to board positions, but they’re not actually thinking. Of course, this begs the question: is thinking just a sophisticated form of reacting involving deeper networks?

To me, this is the scariest part about AI. Not the potential for species extinction, but the dawning realization that maybe our own brains aren’t that special after all. Maybe we’re not so special. Sing it with me now…

This post took much longer than expected because I had to stop and make air quotes every paragraph or so.Hope you've enjoyed this brief foray into computer science. We'll be back to our regularly scheduled programming of yelling at The Man next post.

Papers

Goodfellow et. al — Generative Adversarial Networks (2014). [link]
Mirza et. al — Conditional Generative Adversarial Nets (2014). [link]
Gauthier — Conditional generative adversarial nets for convolutional face generation (2015). [link]
He et. al — Deep Residual Learning for Image Recognition (2015). [link]
Radford et. al — Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (2015). [link]
Perarnau et. al — Invertible Conditional GANs for image editing (2016). [link]
Arjovsky et. al — Wasserstein GAN (2017). [link]
Antipov et. al — Face Aging with Conditional Generative Adversarial Networks (2017). [link]

Blogs
Ujjwal Karn — A Quick Introduction to Neural Networks (2016). [link]
Ujjwal Karn — An Intuitive Explanation of Convolutional Neural Networks (2016). [link]
Andrej Karpathy — The state of Computer Vision and AI: we are really, really far away (2012). [link]

FaceApp; or, How I Learned to Stop Worrying and Love the Machines

Papers

Written by Jieren Chen