This essay will survey classes of modern generative tools available to the machine learning practitioner by way of a review of each of the original papers announcing their operation. We’ll cover some key generative models in the order that they were published, starting with Auto Encoding Variational Bayes (2013), Generative Adversarial Networks (2014), and ending in the recently announced Glow: Generative Flow with Invertible 1x1 Convolutions (2018). It should be noted that the are a few other precursors or variations upon these tools (such as autoregression and earlier iterations of flow based tools) which we will leave out of discussions in the interest of time. The hope is that in the process of review and summary of key findings of these respective papers I personally will gain a better understanding of the workings of these somewhat complex algorithms, and hopefully a reader looking to come along for the ride may expect the same. The survey of the papers will largely be conducted as a restatement and summarizing of many of the key findings of each paper, but with the addition of some explanatory prose for those cases where I felt the paper may be less accessible to the layperson. For some cases of vocabulary that may be less intuitive I expect may also link to some supplemental wikipedia reading as a backup resource — although this channel should always be taken with a grain of salt, I’ve found this is often a suitable way to quickly gain an idea of what is meant by some novel terminology, even in scientific fields. As I’ve adopted as a common practice in these writings, I expect will also include some supplemental images and music to add color to what might otherwise be at times a somewhat dry presentation. And without further ado.
Part 1 — Auto-Encoding Variational Bayes
- Based on review of Auto-Encoding Variational Bayes by Diederik Kingma, Max Welling (2013)
I think the easiest way to start this description will be to first capture an illustration of the type of capabilities that are enabled by the variational autoencoder tool before diving into discussions on inner workings of the algorithm, and in so doing I’ll use the demonstration from a more recent work which captures some improvement of image resolution vs the demonstrations captured in the original paper’s appendix.
What is being demonstrated here is that that through the act of training a variational auto encoder with an appropriate dataset, the algorithm can estimate latent variables, in this case the example of a latent variable is the range of smiles, that can then be used to progress a source image along that vector. I’ll call attention to a few points here to highlight what is being demonstrated. Note that the ranges of smiles are not only merely reshaping the mouth, the whole face is being transformed such as smile lines appearing under the eyes or or the shadowing around the cheeks. Also note that it is also possible for these progressions to be extended beyond the range of characteristics found in the training set, crossing the boundaries of training to project further at least to some degree (albeit likely becoming increasing surreal in the process). Extending on a point I tried to capture in my previous post “A Toddler Learns to Speak”, keep in mind that this demonstration doesn’t need to be constrained to facial expressions but could be extended to eventually e.g. manipulations along the axis of most adjectives in the English language for supervised training methods or perhaps even features beyond the range of our vocabulary with unsupervised training methods are incorporated. Also note that these demonstrations on images I expect will eventually be capable of extensions to other analogous applications in alternate modalities such as audio, language, etc. These observations go a little beyond what is addressed in the paper, so let’s back up and walk through the algorithmic considerations addressed by the writeup. In so doing I’ll mostly try to follow the order of discussions presented in the paper.
Some qualifications to this method addressed early in the paper are that these methods are applicable for an i.i.d. dataset and continuous latent variables. The i.i.d. qualifier is an acronym for independent and identically distributed variables which I am interpreting as a suggestion for how best to prepare a training set from which we can extract our latent variables. If our dataset has unusual distributions of characteristics then the evaluation will suffer — for example if we were looking to extract a latent vector for hair length we would want our dataset to include some range of randomly distributed lengths from bald to waste length whose inclusion is independent of other latent characteristics — if all of our bald examples included beards but none of our long hair examples did, well the algorithm would have trouble distinguishing the hair length and facial hair characteristics. For the idea of a continuous latent variable consider again the smiling face: the range of smiles has no discontinuities, whereas if we were to try and incorporate the inclusion of facial jewelry for instance, well there is no scale of whether the face does or does not include earrings, this is a binary construct. Although a qualifier here is that the paper later in section 3 gives the example of using the variational auto encoder to evaluate some Bernoulli distribution which is a binary construct by definition, so I think in this case it is not the variable that is continuous but the variable distribution (Bernoulli) parameters that are continuous, so this is still allowed too — although is not as prevalent in the demonstrations.
Once we’ve confirmed our dataset meets these characteristics, we want to form an objective function upon which we can perform our optimization. The purpose of this objective function will be such that we can derive a maximum likelihood for update to our global parameters (e.g. for an image the global parameters would be the RGB values of our pixels) based on some variation in our latent vector space (e.g. the latent vector such as the extent of a smile vector). Note that the paper calls out that we will have a stochastic objective function, meaning there will be some variables with a randomization of output subject to some distribution — in fact the application of Bayesian inferences via conditional probabilities for dealing with the stochastic nature of our objective is a key ingredient.
I’m going to try to dive a little into the equation parameters that form the basis of the training operation, but am mostly going to avoid the full derivation in interest of an attempt at a textual treatment with only a few examples of key equations— the full derivation is included in the paper and I wouldn’t be adding much by simply restating what was already published with precision, but the hope is that an attempt to describe the operation of this equation I can support the comprehensibility of the formal presentation from the paper while only including a selection of key equations, thus a reader looking to get the full value here should consider reviewing this work in adjacent or parallel to each of these papers which are freely available on arXiv.
The key variables that form the basis of our evaluation follow. X is our training dataset of N samples xi of variable x. The basis of our latent property is z, realized as a continuous random variable (e.g. our example’s “smile vector”) from which we can represent discrete points as zi. We have two models that are jointly trained in the act of optimization, the generative model parameters θ and a separate set of variational model parameters φ which we’ll call the recognition model. We’ll say the probability of observing some distribution of generative model parameters θ is described by pθ, and qφ will be the comparable probability associated with the recognition model parameters φ. The probabilities pθ and qφ will be expressed based on some state found in our training or latent sets such as some probability like pθ(z) or some conditional probability such as pθ(x|zi). With these ingredients we can start to build our objective function.
The objective function that we’ll try to optimize is derived by first evaluating a marginal likelihood for a set of N generative model probabilities as sum(logpθ(xi)), which can be re-expressed as a sum of two figures, the first of which being a Kullback–Leibler divergence (KL divergence, abbreviated to DKL in equation below) between the conditional probabilities qφ(z|xi)|| pθ(z|xi) and the second being the expected value for a set of parameters which can used to describe the lower bound for this marginal likelihood, defined below. In deriving the objective function, we’ll specifically focus on the second portion of this derivation (the lower bound).
We’ll be optimizing (minimizing) this lower bound L with respect to both the generative model parameters θ and variational model parameters φ, of which the φ set is the more challenging of the two as a usual (naive) Monte Carlo gradient estimator for this type of problem exhibits high variance and is thus undifferentiable with respect to φ (which I am inferring is due to the use of nonlinear activations in the network). The trick to the method is to construct a differentiable estimator by “reparameterizing” the conditional probability qφ(z|x), where so long as z ~ qφ(z|x), with z as a continuous random variable, we can approximate z as some deterministic variable derived from some auxiliary parameter ϵ we’ll introduce for this purpose, as z ~ gφ(ϵ, x), with this function g parameterized by the variational model parameters φ but as a function of our auxiliary variable ϵ (a stochastic variable normalized such that an associated p(ϵ(l))~ϵ(l) ) and our training variable x. Using this trick we plug our estimates of qφ(z|x) into the formula for lower bound, which gives us the ability to calculate gradients for our objective function, and thus the ability to jointly train our generative and variational model parameters θ and φ for update of our global parameters x based on a latent property z.
Of course the potential to perform this reparameterization trick is not universal, and it turns out there are some restrictions of the type of distributions of qφ(z|x) that allow this method. The paper gives three examples — those series of distributions with a tractable inverse CDF (cumulative distribution function), those distributions analogous to Gaussian, or some associated composition distributions. (An external resource is sited for solutions of added computational complexity for distributions outside of these sets).
The details of the training operation are further spelled out in the paper, with considerations such as parameter initialization, sampling considerations for the dataset, and the like. I’ll leave these other considerations for the formal write-up, as the goal here is not to duplicate all of the findings of the paper but instead to highlight a few of the key concepts as well as to project a little from what is covered. I think a key takeaway I have that is perhaps not fully covered by this writeup is with respect to the limitations of the technique for inferring variational distributions against some latent vector. The fact that we are merely estimating the lower bound as opposed to deriving a full representation of marginal likelihood I think is what prevents us from making this a reversible operation, and I am thus inferring that the generation of each one of the progressions along the smile vector that was given by our demonstration may require a separate trained encoding operation.
Part 2 — Generative Adversarial Networks
- Based on review of Generative Adversarial Networks by Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)
The second generative method we’ll visit is one that is I suspect a little more widely known, the GAN architecture aka Generative Adversarial Network. Whereas the application of Variational Autoencoders we explored was based on the transformation of some prior image along the axis of one of it’s latent variables, the GAN approach has a wider band of potential output, for although it’s generations are still grounded in properties realized through evaluating the representations provided in a set of training data, the developed creations are potentially fully novel in realization, with a developed form consistent in characteristics if not matching any specific pixels. To illustrate consider a demonstration this blog has pulled from before, from a 2018 paper published by researchers at NVIDIA.
The images here are novel depictions of imaginary faces derived by training a GAN on a dataset featuring a collection of celebrity images. While the realism and resolution of these images are certainly impressive (partly realized via algorithmic feature innovations extending beyond those considered in Goodfellow’s paper), it is probably worth an asterisk that the challenge of training facial depictions carries a somewhat lower bar compared to other potential applications. The tendencies for training and target depictions to feature front-facing people near eye-level means the algorithm need not fully represent a 3D world model as might be necessary for rotations or different perspectives of view. Although we humans have very well-honed ability to recognize deviations in facial features, in reality the range of eye/nose/mouth relations seen in the population is somewhat limited, albeit the increased resolution of this still impressive demonstration necessitates some added model complexity such as facial hair, smile lines, or lighting.
From an algorithmic perspective, the true innovation of the GAN architecture is the development of a generative training approach which does not require stochastic optimization. When we look back at the variational auto encoder considered the previous section, we found a requirement to incorporate some Markov Chain Monte Carlo technique into the analysis (Kingma’s paper was a little vague on this point but this became pretty clear through subsequent literature review such as this GAN paper under discussion). Consider a stochastic optimization as one applied where the objective function’s fitness landscape is not constant in layout but has some parametric probabilistic distributions of arrangements along the various axes making it unsuitable for gradient based methods. Somehow in the development of this technique, the dual training of two adversarial models tames what would otherwise be an intractable problem of solving a problem stochastic in nature. Part of the novelty of the approach is the introduction of a noise variable z of distribution p(z), which we’ll use to inject some randomness into the operation of the two models’ interplay.
So let’s use this as an opportunity to transition to the question of exactly what are the two adversarial models of the technique, and I suspect a lay reader may find this paragraph a touch more intuitive than the one preceding. In a GAN network we first have a generative model which is designed to create novel representations of characteristics found in the training set — one that we’ll consider as a feedforward multilayer perceptron (although as the paper notes there is also potential to incorporate an undirected network such as a Boltzmann machine in this place). On it’s own, this network probably wouldn’t amount to much with contemporary training methods, and the genius of the approach is the incorporation of a second model, such that through the adversarial interplay of the two models we resolve some of the stochastic intractability of extracting the training set’s characteristic features. This second model is one we’ll call a discriminator, and it has a very simple objective function: given some input (such as for our example some image data intended to represent a celebrity face as might be produced by our generator), it will output some value between 0–1, where 1 would represent a confirmation with certainty that the input originated from the training corpus, a 0 representing with certainty the counter assertion, and a value of 1/2 representing D’s complete uncertainty between the origination of the input between he training data x or the generated data g.
The weightings of our two models are continuously updated through the application of our algorithmic ‘for’ loop for some number of training iterations, which I’ll briefly describe here. It is worth note that the paper recommends updating the discriminator model some integer k (a hyperparameter) times in between each single update to the generator model — apparently the generator model is easier to overfit. So for k steps we start to update our discriminator model by drawing a mini batch of m number of samples zi from our noise distribution p(z) which we’ll use to seed our generator model. We’ll then sample a comparable mini batch of m training data points xi (whose structure is equivalent to the output of our generator model). The weightings of the discriminator model are then updated via backpropagation by applying a stochastic gradient derivation with the following objective function:
Here the gradient of θ represents the slopes along the weightings of our discriminator’s neural network D, with D(x) representing the application of our discriminator to the actual training data and D(G(z)) representing the application of our discriminator to the output of our generative model G seeded with some value from our noise distribution z. So once we’ve applied our updates to the weightings of the discriminator k times, we then turn to update the generator model weightings, evaluated with some new sampling of m points zi from our noise distribution p(z).
This cycle of updating the discriminator actor k times followed by a single update to the generator is repeated through the number of training iterations (another hyperparameter) and upon completion we’ll have a generator model G which can provide output data given some seeding z. Left to run in this loop indefinitely, we would eventually reach a state where the the two models reach an equilibrium in their training, which in the paper is shown to correspond to the state where D cannot distinguish between either set and thus D(x)=D(G)=1/2. Note that the update to generator model weightings in this second objective function did not draw from any sample images from the training data, the only interface the generator model has with the actual training data is secondhand — through the incorporation of the discriminator output into the objective function which itself had been trained on the actual data from our training set. I suspect this type of arms length access between the training of the generator and the actual training data via a trained discriminator may help prevent overfit of the generator, and I’ll offer in closing that am left to wonder if this type of tiered training, with the inclusion of an injection of randomness via some noise function, could be generalized to a universal gradient based stochastic optimization function.
Part 3 — Glow: Generative Flow with Invertible 1×1 Convolutions
- Based on review of Glow: Generative Flow with Invertible 1×1 Convolutions by Diederik P. Kingma, Prafulla Dhariwal (2018)
The final generative tool we’ll address is one published by some of the team at OpenAI, known as Glow. The Glow paper was only published earlier this month, so it is fairly new tech and extends some of the capabilities as realized by those discussed prior. It’s kind of fitting that we get to close out with this paper, as it does a neat job of putting these generative tools into perspective, including some discussions about what is enabled by each paradigm. It’s also probably worth pointing out that this paper shares an author (Diederik Kingma) with the first paper we reviewed. The paper first offers that some of the key challenges facing the field of machine learning, such as the ability to learn from a small number of data points or the ability to generalize in between changes in task or context, can partly be overcome with generative models. Through the training of a generative model we can extract unlabeled features of a corpus and develop a realistic world model for the associated domain — whether that domain be limited to images, text, music, or physics for instance — or I speculate eventually some meta-domains capturing multiple of these modalities. The idea being that we gain capability not only in the original act of generation, but downstream tasks built off that world model will benefit as well, with such models generally realized in the form of a very high dimensional joint probability distribution.
The paper notes that there is a fundamental distinction between the GAN approach to generation and those likelihood based methods such as the variational auto encoder or the Flow technique which was a precursor to Glow (with the paper also touching on autoregressive models with are not included in this discussion). One limitation of the GAN technique is lack of allowance for representation of data in a latent variable space such as may allow for interpretations between data points and associated modifications. Variable auto encoders have some advantage here as they may infer estimates to latent correlation with data, albeit limited to the lower bound for the log probability of some correlation. The strength of the Glow technique is thus the ability to not only estimate the lower bound for the probability associated with some datapoint’s correlation to a latent variable, but also to optimize the full representation of that likelihood which enables reversibility in modifications. The paper also notes benefits of the Glow technique with respect to ability to parallelize training / synthesis along with memory savings in computing gradients. Consider the really neat demonstration available on the OpenAI blog, which allows a user to upload their own images for Glow based variable mixing or manipulations.
I found the technical narrative of this technique somewhat more complex than those preceding. I’ll try to address the high points of the architecture here but do encourage any reader to supplement this essay with the more formal treatment addressed by the paper either adjacent or parallel to this discussion. The paper starts by describing the key ideas for the Flow generative method, which was originally introduced in a 2014 paper under the name “NICE” and apparently since rebranded. Once we get through the Flow operation we’ll incorporate some additional concepts that extend the Flow technique to the reversible Glow method.
For operation of the Flow method, a precursor to the Glow technique, we start by constructing an objective function of log likelihoods of our model output probability pθ(xi) summed for each randomly selected xi from a high dimensional i.i.d. dataset with an unknown distribution, with model parameters θ. This log likelihood objective can be applied to a series of discrete points or extended for cases of continuous data (for which treatment I’ll defer to the paper), and optimization of this model can be addressed through stochastastic gradient methods applied to minibatches. The log likelihood for each xi of logpθ(xi) can be estimated by the sum of the log likelihood of our distribution pθ applied to a latent variable z as pθ(z) plus the log-determinant of our transformation matrix, where z is the latent variable distribution with some tractable density. This transformation matrix is the result of applying a series of invertible transition functions gθ such that z = fθ(x) = gθ^(-1)(x). For these transformation functions f or it’s inverse g, we can think of them as a series of K invertible intermediate transformations, of which for instance the application of invertible function f to transform from x to z could be thought of as first the application of f1 to transform x to h1, then f2 to transform from h1 to h2, and etc until we apply the final transform fK to output z (or vice versa with a series of K transforms gi to transform from z back to x). The log determinant matrix for change in log density for z with respect to x can be restated as the sum over K of the log determinant for change in hi with respect to h(i-1). One of the key ideas introduced here is that this log determinant becomes tractable when we constrain our allowed transformations such that their Jacobian dhi/dh(i-1) must be a triangular matrix (i.e. a square matrix with all zeroes either above or below the diagonal). By introducing this constraint to our transformations, we can evaluate our log determinant with only the values from the diagonals of the Jacobian matrix dhi/dh(i-1). It turns out that many square matrices can be described as a product of some corresponding upper and lower triangular matrices.
The paper then goes on to propose some extensions of this Flow technique, utilizing a series of algorithmic steps including the incorporation of 1x1 convolutions and in the process simplifying the architecture. In each step of flow, first, as an alternative to batch normalization in training deep networks, it proposes what they call an actnorm layer which normalizes activations to zero-mean and unit variance, and is more suitable than batch normalization for small number of minibatches per processing unit such as we might come across when parallelizing the operation, with the associated scale and bias for this normalization initialized by pretaining on the data such that during the following algorithmic progression they are then independent of data samples. The next step of flow achieves a permutation between variables, ensuring that the attributes have the ability to influence each other between dimensions, via application of a learned series of 1x1 convolutions (the paper uses some experimental results to demonstrate that this 1x1 convolution application is a more efficient type of permutation than some alternatives such as a reversing operation or shuffling) — in fact it is this novel approach to permutation via 1x1 convolutions that is the main contribution of this paper. The third part of each flow step is the affine coupling layer as an affine transformation, which is defined in the paper’s Table 1. The split operation here I believe is referring to a split of x into corresponding lower and upper triangular matrices L and U. I was a little confused at first as to why ya was treated differently than yb in the affine transformation definition from table 1 but I think the reason has to do with the way that we derive our triangular matrices in the split, L is derived so as to have diagonal values of all ones and U is derived so as to have all zeros in the diagonal. The s variable that comes up in the definitions for each of these three steps is a vector derived from the 1x1 convolution weight matrix’s (W) LU split. It turns out that this LU split affords a material improvement to the computational cost of deriving the determinant of our weight matrix W.
I found the corresponding OpenAI blog post helpful in understanding the implications and potential of this technique, as it did cover some additional material not directly addressed by the academic paper, so I’ll attempt to quickly gloss a few of those aspects here. The new potential to infer latent vectors in a reversible fashion means that we will have efficient inference and synthesis along with the ability to parallelize both aspects coupled with memory savings for the computation of gradients. Especially interesting was the idea of a useful latent space for downstream tasks. I think the buried lead here is that this type of linear transformation matrix will enable us to learn the optimal perturbation for a given task. Consider here an arbitrary hypothetical example for the modality of images: internet advertisers will have the ability to efficiently feed custom generated images from some seed input to individual consumers based on their personal demographic profile or browsing history. Now consider this capacity extended to other modalities like text or audio. I expect this is the type of capability that will be enabled with this tool. Hang on to your seat it’s going to be a wild ride.
Albums that were referenced here or otherwise inspired this post:
Preservation — Preservation Hall Jazz Band (CD)
(As an Amazon Associate I earn from qualifying purchases.)