Recently, hyper-realistic fake photos took social media by storm. You’ve seen people posting photos of themselves, either “old-ified” or “infantized”. And those photos are so realistic, you can’t help but wondering what’s the image processing magic behind it.
After some digging, I found out the “culprit”, again, is deep learning. But this time, it is quite different from what I expected.
Deep learning applications, from Convolutional Neural Networks to LSTM Recurrent Neural Networks, mostly have a straightforward, supervised learning characteristics. For example, training a CNN to distinguish cats from dogs requires you feeding the network a whole bunch of images, ones labeled as dogs and ones labeled as cat. The supervised part comes in when network makes a guess. If the network guesses correctly, the output layer passes along the positive rewards back to each layers of neurons, adding positive weights to different connections. If the network guesses wrong, the output layer gives back negative gradients, propagating backwards to important connections. This supervising process is called backpropagation, an meta-learning strategy that uses SGD, ADAM or other gradient based algorithms to optimize the network. Eventually, under this supervised training process, a network will learn proper weights on different edges, neurons, and so on. And next time you give it an image of cats or dogs, it will likely make a very accurate prediction.
This kind of supervised learning is quite intuitive, right? The backpropagation passes back the gradients of |y_label — y_pred|with respect to each edge, and the weights gets properly adjusted. The update rule, put simply is:
But, does training CNNs in this supervised fashion tells us anything about the images themselves? In one sense, yes, the convolutional filters are learned to extract patterns, edges, contextual or compositional information from the data; and if these elements are manipulated properly, you can even impose a painting style to a targeted image, such as making a random photo Picasso style.
However, fundamentally, we don’t have a statistical model to explain how these images are distributed, i.e what kind of pixel distribution produces a cat image or a dog image?
And if we can get a probability density function, that effectively represents cats and dogs, we can sample this distribution to “forge” cats and dogs images that are super realistic, but not real. Similarly, if we can find a distribution for human faces, we can literally generate fake human photos, that are realistic but don’t belong to real human beings.
This is where generative flow models come in and generative flow models’ goal is to estimate data distributions accurately. However, the task of estimating the distribution of image data is a challenging one; mainly because the complexity of image data, the amount of pixels, the different color channels, and more importantly, the semantic relationships embedded in each image that distinguishes itself from another. Mathematically, our goal is to find a good representation of data, expressed as a probability density function.
Like GAN and VAE, flow-based models have an encoder-decoder like structure. The main difference is that the encoded output, z, in flow based models, is explicitly representative of the log-likelihood of original data X. In other words, the “Flow” consists of carefully chosen functions (injective and invertible), such that the estimation of the overall data distribution is broken down into independent smaller distributions:
As shown above, our target data distribution P(x) can be written as smaller estimations of P(z) with Jacobians of carefully chosen image transformations f(x). The log-likehood is also formulated nicely at (3). (One key thing is that the function f(x) must be invertible and bijective, and also result in triangular Jacobians, so that the determinants are easy to calculate.)
But how is each small P(z) really trained?
According to the RealNVP paper, they used something called coupling layers. Where each layer took part of the pixels unchanged, and the other part going through an invertible affine transformation, such as masking, or checkerboarding. Such transformations are easily reversible, and evaluation takes little computations as well.
So, the data X goes into a coupling layer, some of its elements are kept unchanged, and some of its other elements are affinely transformed. The data then feeds into the next layer, with smaller dimensions (like an encoder), and the next layer is again a coupling layer that does the same thing. At some arbitrary point, the result data, z, is generated. And one uses (3) explicitly as the loss for the network weights, and backpropagates the error back. This process, similar to traditional supervised learning, is actively optimizing the weights of the network, so that vector z has a higher log-likelihood to represent data X. Once a convergence (or tolerated convergence) is achieved in this log-likelihood maximization process, we have a data distribution that we can inference and sample.
Now, why does data distribution estimation matter? Well, it is a statistically significant way to describe data. And when we do this well, we can forge data that is practically indistinguishable from reality. This kind of generative flow models may not be the best image classifier, but it sure gives us interesting photos and selfies to look at. Think about it, if we use generative flow models to model all aspect of our world, will the machine eventually be generating hyper realistic scenaries, sounds, smells, and other sensory data, that we can’t distinguish it from reality?