Let’s Read Science! “StackGAN: Text to Photo-Realistic Image Synthesis”

29 min readSep 8, 2017

An amazing part of the modern world is that when you encounter a new development in the world of science, with only a bit of digging, you can generally find the direct source material — the science paper itself.

And with that same power, you can just dive in; researching the words, terms and ideas that are unfamiliar. It’s a remarkable way to learn.

So - let’s do that, because shoving dense, barely comprehensible information into our heads is fun!

This episode: “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”

Fun Fact: 7 people at 4 universities on 2 continents worked together on this paper!

Short version: Some smart people at a bunch of universities got a computer to turn phrases (“this bird is black with green…”) into pictures, using something (apparently) called “stacked generative adversarial networks”.

“Stage I” and “Stage II” get explained early on.

I don’t know what most of that means. Let’s get started!

Abstract

“…In this paper, we propose stacked Generative Adversarial Networks (Stack- GAN) to generate photo-realistic images conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description, yielding Stage-I low resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high resolution images with photo-realistic details.”

This gives a pretty solid idea as to the basic concept; a concept that is pretty neat and remarkably similar to how many artists (say, painters) work: take your idea, turn it into a sketch, and then start adding the details.

If you’re having trouble understanding the paper’s intro paragraph, realize you don’t have to understand all of it at once. For example, think of it more like this:

“In this paper, we propose THIS THING to generate photo-realistic images conditioned on text descriptions. The THING ONE sketches the primitive shape and basic colors of the object based on the given text description, yielding [redacted] low resolution images. The THING TWO takes THING ONE results and text descriptions as inputs, and generates high resolution images with photo-realistic details.”

And now the rest of this post will be figuring out what these “THINGS” actually are.

(reads some more)

The rest of the abstract tells us a lot about the current state-of-the-art: small size, maybe not so “plausible”; and what training and verification datasets they used to do all this (“CUB and Oxford-102”). I think we can get way with not understanding what those data sets are — if we do, we’ll turn to Google when it becomes apparent.

If you don’t already know something about machine learning, you might not know what datasets are. They’re roughly (but badly) analogous to “the textbook and the test”; AFAIK, most datasets contain both (I don’t expect “CUB” to be the textbook and “Oxford-102” to be the test; you’ll use part of CUB as a textbook, and part of it as a test); and the two parts are the same sort of stuff — in this case, probably images with descriptions. Part of each data set will be fed to the computer so that the software can “learn”, and then the rest is used as basically test questions to see if the computer actually did learn.

Introduction

“The main challenge of this problem is that the space of plausible images given text descriptions is multi- modal. There are a large number of images that correctly fit the given text description.”
“Reed et al. demonstrated that GAN can effectively generate images conditioned on text descriptions… However, their synthesized images in many cases lack details and vivid object parts, e.g., beaks and eyes of birds. Moreover, they were unable to synthesize higher resolution images (e.g., 128 × 128) without providing additional spatial annotations of objects.“

It sounds like the main issues with getting computers to paint is pretty similar to the one every person encounters (note: I am also not a painter, so, could be horribly wrong) — in the beginning, when you’ve got that blank canvas there in front of you… where do you start? Where do you actually want to put everything? And once you’ve started, you’ve actually got to have the skill to paint the thing you’re trying to paint.

Fun fact: Picasso also did realist paintings. They’re really incredible, and I recommend you go check them out. It’s weird to realize it, but when you look at modern art paintings- there is an intense amount of skill that goes into painting a square of color exactly the way you wanted.

(reads some more)

“To tackle the challenges, we decompose the problem of text to photo-realistic image synthesis into two more manageable sub-problems”
“A low resolution image is generated using our Stage-I GAN… [then] Stage-II GAN [generates] realistic high resolution images conditioned on the low resolution image and the corresponding text description.”
“Since Stage-I GAN generates rough shape and layout for both the object and background, Stage-II GAN only needs to focus on drawing details and rectifying defects in low resolution images. This task is much easier than directly drawing a high resolution image from scratch. By conditioning on the text again, Stage-II GAN learns to capture text information that are omitted by Stage-I GAN and draws more details for the object.

I cheated a little bit — I’d started to read this paper before I started trying to blog about reading this paper, and that’s why I’ve mentioned the “sketch-then-paint” method for art. (I also noticed that way down at the end they have a lot of example images)

There’s also this bit embedded in there: “Stage-I… from a random noise vector”. Turns out, there’s a lot of places where random is actually pretty excellent (say, load balancing) and hard to beat. If you (like many of us) got your start on machine learning with Google’s Inceptionism post, I’m betting this is the same initial trick that they used: feed the software a bunch of “snow”. Seems reasonable, like staring at the stucco until you start seeing faces (and then in this case, painting those faces).

So they’ve basically got one part of the program making a rough sketch — choosing where to put the different stuff that’s supposed to be there, the rough colors to use, etc — and then a second part to come fill in all the details. In fact, they name this division of labor as the main contribution they’re making

And again, it has some real similarities the normal artist’s process — although there’s no indication as to whether they were inspired by this process, or came up with something so similar independently:

http://www.stars-portraits.com/en/tutorial/drawing-eye.html

(reads the rest of the section)

Related Work

(reads the first paragraph)

deconvolutional neural networks… deterministic neural networks as function approximators… Variational Autoencoders (VAE)… DRAW model Autoregressive models (e.g., PixelRNN)… Generative Adversarial Networks (GAN)

I have have only an inkling as to what each of these things are. However, at this point, I don’t yet turn to Google — it seems unlikely that I’d need to understand each of these to understand what’s going on in this paper, so let’s keep going and see if we can either narrow the list down, or directly learn about the ones we’re going to care about

(reads the next paragraph)

Built upon these generative models, conditional image generation has also been studied.

This fills in a bit of the confusion as to what the last set of things are, although it seems pretty obvious now — as named, they’re all approaches to making a computer generate an image. “Generative” meant exactly what it would seem to mean.

(reads about half the next paragraph)

Besides using a single GAN for generating images, there is also work that utilized a series of GAN for image generation. Wang et al…. proposed S2-GAN

So, neat: these aren’t the only people thinking about how to “layer” the effects of generative models (which we just figured out). Although they do describe the difference between StackGAN (theirs) and S2-GAN (Wang’s), it’s really not clear to me at the this time what that difference is.

(reads the rest of the paragraph)

Denton et al. [3] built a series of GAN within a Laplacian pyramid framework.

Well now, that’s a bunch of big words — I used to know what a Laplacian is (it’s a mathematical transform — turns a set of numbers into a different set of numbers so you can learn something about the original set of numbers), and I’ve seen pyramid frameworks before, although this one sounds different:

At each level of the pyramid, a residual image was generated conditioned on the image of the previous stage and then added back to the input image to produce the input for the next stage

Annnd I’m still guessing as to what exactly “conditioned” means; but at least the rest is decently clear; the pyramid framework is just collecting the outputs of a recursive function. We also don’t know what “residual image” means, but again; not worried.

Okay — before we move onto the next section, let’s list the thing we don’t know that we just saw, and see what we can learn from a bit of time on Google Search. I’m literally just going to take the name from above and search for it; then poke around from there.

Deconvolutional Neural Networks

It becomes quickly apparent that first we need to understand “convolution”. This is actually one of the coolest things I’ve encountered in math — it’s a neat way to “combine” two mathematical functions:

https://en.wikipedia.org/wiki/Convolution

There’s a deeper explanation (and of course math) but these gifs really do it justice. The use I know of for convolution is system response; let’s say you got a (simple) circuit. You can turn the entire circuit into an equation, that describes what happens when you power it up. Convolution then lets you figure out what happens when you power it up differently: are you ramping up the power? flicking it on? is it AC power? Is it a signal?

Okay — but how does this work when applied to neural nets? It took a bit of digging, but here’s what I found: Convolutional neural nets have a bunch of input “neurons”, and those neurons don’t actually receive a pixel from the image directly — they receive a convolution of the image with some 2-dimensional function. These do a good job explaining:

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

Ohkay — last step. Deconvolutional neural networks. I didn’t find anything as clear as for convolutional networks, but I think I pieced together an understanding from “deconvolution is the inverse of convolution, like division is the inverse of multiplication”, this Quora question, and this slide deck. It looks like a deconvolutional neural net redraws the initial image in “terms” that the neural net “thinks in”:

http://cs.nyu.edu/~fergus/drafts/utexas2.pdf

If I’m reading this right, the network is convoluting the input image with the 1st layer filters — that gets you what’s up top, where you can see where each convolution had the most overlap with the image. Then you, I guess, run the whole operation in reverse, and get out the reconstruction, from the “understanding” the the neural net established via it’s filters.

Well that was a rabbit hole. Next!

Deterministic neural networks as function approximators

Unsurprisingly, simply googling the entire phrase didn’t get much. It did turn up this paper, but I’m having trouble with just the abstract, so let’s try something else. This is a two-part phrase, and we can look at each:

“Function approximators” is pretty clear (and is backed up by what I did find with the whole phrase): you write a simpler function that is close, but not exactly, identical to some other more complicated function. It’s like how checking the temperature by sticking your hand outside is an approximation for getting a thermometer and checking that. So it seems decently clear that this is a neural network that approximates some function — but what function, and what’s a deterministic neural net?

Googling around a bit indicated what the answer might be, and then some more googling confirmed it: there are “stochastic” and “deterministic” neural nets. Stochastic straight-up means “random”; a stochastic neural net incorporates randomness into its workings. “Deterministic” means that given an initial situation and what happens, you can determine the new situation; in other words, non-random. This isn’t to say that stochastic neural nets produce random outputs; not at all. It’s like deciding what’s for dinner by rolling a die versus seeing what’s on sale: Both get you fed, but one is “powered” by randomness and the other by cause/effect.

Variational Autoencoders (VAE)

I have high hopes for this being “easy” — this is a pretty weird combination of words, we’re likely to get good search results.

…So part of that was true — there are many good search results, but I’m not really able to follow pretty much any of the explanations. This is the best understanding I have as to what VAEs are:

https://jaan.io/what-is-variational-autoencoder-vae-tutorial/

The encoder and decoder are neural nets of some kind. Send data in, get a representation of that data in terms of Z (“in Z-space”). Take some representations of data in terms of Z, get it out through the decoder, get back out something similar to the original data you sent in. Train the whole system to minimize the difference between what you send in, and what you send out.

Think of it like the written versus the spoken word. You say something (data in); some one else writes down what you said (representing the sounds in the “z-space” of letters); someone else can now repeat the sounds you made using the letters (z-space encoding).

I’m still not sure how this can be used to generate images, but let’s move on.

DRAW model Autoregressive models (e.g., PixelRNN)

I’ve made the least headway understanding this one (might make a good post after this!). Top result tends to be this paper: https://arxiv.org/pdf/1601.06759.pdf — which seems “clear”, if you already know the terms they’re using (I don’t). From what I can put together from what I can understand looking over the paper — and what Google teaches me about “autoregressive models” — it seems like they’re using neural networks to predict what a pixel should look like given…. the other pixels around it? some other input variable? I’m going to draw a conceptual box around it and apply the label “totally not a markov chain”

(Markov chain is where you predict what comes next based on observations of what usually comes next — like how “i” usually follows “e”, or “get off” precedes “my lawn”. AFAIK, it’s what powers the next-word recommendation on your phone.)

Generative Adversarial Networks (GAN)

This seems like the important one to understand, since it’s what’s referenced in the title of this paper (“StackGAN”). Thankfully, it also seems to be the simplest to understand. Like the autoencoder, there are two neural nets, but in this situation they’re called the “generator” and “discriminator”. One makes things, one tries to discern whether those things are real. I found a good analogy while learning this one:

The analogy that is often used here is that the generator is like a forger trying to produce some counterfeit material, and the discriminator is like the police trying to detect the forged items. — John Glover

Basically — create an arms race between these two different neural networks, and watch them both get better. Think about how the “are you a robot” detections methods have changed over the years- Captcha, ReCaptcha, “I am not a robot” checkboxes, etc. Programs have gotten better at pretending to be people; people have gotten better at detecting programs. This is “adversarial learning”.

So, “Generative” since you’re making things (generating them), “adversarial” because it’s two neural nets trying to surpass each other, and network because (and I’m guessing there) the combination of two networks is itself a network.

Ohkay — before we move on from “Related Work” to “Stacked Generative Adversarial Networks” (aka, the meat of the paper), let’s summarize what we’ve learned about so far. A good practice here is “explain it like I’m five”; basically, can you explain it without using special-purpose words?

We have some people (the researchers) trying to get a computer to make a picture basically on demand — you can say, “I want <this kind> of picture”, and the computer goes and draws is. This is (as people learned) hard, because you have to do two things:

Layout the picture — is the bird in the middle? which way is it looking? is it flying, or on a branch?
Create realistic details — does this actually look like a bird, or a branch, or the sky?

It seems reasonable to rephrase this is as: This is hard because first, the computer needs to fill in all the conceptual details that you didn’t supply, and then also fill in all the realism details a conceptual description can’t supply.

Seems clear enough. Let’s keep going.

Stacked Generative Adversarial Networks

Finally, we’re here! Big diagram, some math, looks exciting. Let’s dive in!

(stares at Figure 2, reads the first paragraph and two bullet points)

First thing is a rough description of the StackGAN stages. They should look familiar; this is a more formal description of what we read in the Abstract, and I’ve been keeping this structure in mind when explaining the parts we’ve already read.

(reads the first paragraph in section 3.1)

The generator G is optimized to reproduce the true data distribution pdata

The “true” here feels significant, but not in a way can identify. We’ll keep an eye out for this term as we go, see if other contexts will shed enough light on it to understand.

Now let’s figure out how to read this math:

It’s not the first thing to notice, but notice the that although right-hand side of the equation is written on two lines, it’s really just one line. It’s just two (big) terms added together; minmaxV(D,G) = Thing1 + Thing2.

The next thing I notice is that D and G appear in two ways: as terms (in V(D,G)), and as functions (in D(X) and D(G(Z))). This helps fill in what’s going on on the left side; minmaxV is a function that takes functions (as opposed to variables or numbers or some such).

Reading the next paragraph, we get that the pdata is the training data (real images), while pz is the random noise we’ll use as inspiration.

Finally, this equation is named as a the “objective function” of a “two-player game”. If you’re at all familiar with the Prisoner’s Dilemma, it sounds like that kind of thing: There’s a situation with two players (people, things that do things, etc), and we can represent how well each player does using a number (their score in the “game”). We can then try to describe the entire situation as an equation; then each player can use that equation to describe their score, with the possible actions by the players represented as the variables. (Or, that’s my mediocre understanding of game theory math).

Now — I’m still a bit confused by minmaxV; how do the subscripts of G and D apply here? Is the entire function called minmaxV, or are there three terms? (min, max, and V)? Well, let’s consider: G is pretty clearly “generator” and D is pretty clearly “discriminator”. Going back to the text descriptions:

The generator G is optimized to reproduce the true data…
D is optimized to distinguish real images and synthetic images…

Ah! When you want to reproduce something, you want the difference between the reproduction and the original to be minimal. But when you want to distinguish between things, you want to zoom in on all the differences; maximizing them.

Combining this with our guess as to what an “object function” actually is, maybe what this means is that G wants to minimize V, while D wants to maximize it…? Seems like a reasonable interpretation, but it’s hard to know for sure without more a background in the relevant sciences.

Up next — E x~pdata. I don’t know what the E means, but the next bit means “an X from pdata”; in this case, an image from our set of training images (pdata).

Up next — log(D(x)) For those of you that don’t know (or remember) what a logarithm is, here’s a graph:

https://en.wikipedia.org/wiki/Common_logarithm

Take a close look at the bottom scale there. Logarithms are sort of a measure of the “bigness” of a number; 1–10 is small (say, 0..1), 10–100 are medium (1..2), 100–1000 are big (2..3). But, it makes a pretty huge difference if we’re thinking about log(x) with x between 1 and 10, or with x above 10, or with x less than 1. An x between zero and 1 turns into a negative number, and given the 1-D(G(Z)) expression, it seems like we should expect D(x) to turn into a number that’s less than 1.

But, this equation still isn’t clear; I’m still not sure what E means. Staring at it some more, I notice that two terms aren’t heavily related; the top term uses x, and the bottom uses z. The values of the two terms don’t really effect each other, except through D. It seems like E would mean some sort of accumulation; the sum of results of the log(D(x)) expression across the possible values of X, or maybe a percentage (the percent of the time that the differentiator is correct on the training images?)

I’m honestly not sure, and am having a devil of a time puzzling it. Let’s acknowledge our confusion (“I’m confused and that’s OK”), and move on. Although the details are elusive, it does seem like this equation is relating the success of the differentiator on the training images, to the success of the generator at fooling the differentiator. I think we can move forward with this understanding.

Fake Edit: While writing the next section, I looked up mathematical symbols on Wikipedia, and what do you know! Our weird E[...] is present, and it’s something like “the probably expected value”, a fancy kind of average. Like E[6-sided die] is 3.5 (think about it). Still not sure how to interpret (let alone explain) the log part of these equations, but that’s another piece to the puzzle.

(reads the second paragraph in section 3.1; right before section 3.2 starts)

Conditional GAN is an extension… [with] additional conditioning variables c, yielding G(z, c) and D(x, c). This formulation allows G to generate images conditioned on variables c.

Basically — we’re not just considering the training images and the noise “inspiration”, we’re also considering some additional information, represented as c. In our situation, c is the phrased used to describe the image. For D, it’s used to check if the phrase matches the image; for G, it’s used as part of the inspiration used to make the image. Seems sensible.

(reads the first paragraph of section 3.2)

Stage-I GAN

Stage-I GAN is designed to… focus on drawing rough shape and correct colors

This matches the conclusions we drew earlier. The details get crazy, tho, as you can see in the next bunch of paragraphs, using crazy letters and phrases like “discontinuity in the latent data manifold”. The hell does that mean?

Let’s figure it out! I’m going to jump around the next bunch of paragraphs, puzzling this out — I recommend you skim up to “3.3 Stage-II GAN” so you have some idea what I’m talking about, but you don’t have to understand any of it yet.

(reads the next paragraph, then stares at the rest of section 3.2 for a bit)

So this bit:

conditioning text description t is first encoded by an encoder, yielding a text embedding 𝜑t

Remember the bit about variational autoencoders, and z-space? Basically, the computer doesn’t understand words, but it can represent the words in terms of something it does “understand”. That’s the “text embedding”, and it’s used as the c from just a minute ago. We just had to turn the information from words to something we can stick in the D and G functions.

latent space conditioned on text is usually high dimensional
discontinuity in the latent data manifold

So I googled around a little bit, and I’ve already got some ideas, and between those I have something of an explanation: Basically, the “latent space” is the picture on which C is some dot (a point), only you need waaaay more than X and Y (or even X, Y, Z, or those three plus time) to draw that picture. So it’s called a ‘space’. (From poking around, there’s some particular meaning to the “latent” in “latent space”, but we’ll leave that alone). The “manifold” is (as wikipedia kiiinda but not really clearly explains) is just a surface in that space (like the ocean surface, or crumpled sheets on a bed). It’s the possible “places” that the text can become, just like how the possible places you can draw on the bed sheet is the surface of the sheet. (You can’t draw above the surface; there’s nothing there to draw on. You’ve got to draw on the surface itself). A discontinuity is a gap, like a tear in the bed sheets. There’s a lot more detail behind the concept of “discontinuity” (and “continuity”), but you can think of it like a tear in the sheets.

So speaking of discontinuities — I accidentally a couple month break from writing this. This next half might feel a bit disconnected from the first; that break would be why.

(actually reads section 3.2, up to “Model Architecture”)

This sentence seems important:

“We randomly sample latent variables from an independent Gaussian distribution… where the… diagonal covariance matrix [is a] function of the text embedding”

Looking up these terms: Latent variables are “hidden reasons for that thing to happen” — ‘hidden’ like how the central air AC is hidden, but you can tell it’s there and on because there’s cold air coming from the vents, even if you can’t see it from inside. Covariance matrix is “how related are this and that reason for the thing happening” (so: weather today and thermostat settings), and a Gaussian distribution is kinda “what do you usually get when you roll some dice a whole bunch”. So basically, my guess is: pick some somewhat random hidden reasons for why the words of the picture’s description show up when they do. And then they tell us why all that matters, which is basically: it’s a way to get more training out of the same data set. I think.

So then there’s this term: “Kullback–Leibler divergence” — and I’ve definitely never heard of that before. Let’s look it up!

This Quora question seemed the most helpful; and what I’m getting from that and (other sources) is that KL divergence is “how bad it is if I choose the wrong way to summarize this information”; in the case of StackGAN, it’s then (I think?) encouraging the the neural net to pick a good way to internally summarize the incoming information.

Then there’s paragraphs of math that’s definitely going over my head, but I think we can get something out of it.

real image I0… text description t… z is a noise vector…

I’m able to translate the vague impression I’m getting from reading this, but maybe I can help you to do the same thing: Ultimately, the equation is an attempt to write down how these bunch of things relate. I don’t think this is something we can properly grok in this moment; what I do for this kind of thing is try to spot patterns of symbols. I might not understand any of this now, but maybe the next paper I read (or the one after that, or the one after that…) something will “click” and the patterns I’ve identified will start to gain real meaning.

This is also why some of the best advice I’ve ever gotten was “never stop taking math classes”.

(reads the “Model Architecture” sub-section)

On THAT note — the second paragraph in this section brings up the word “tensor”. I don’t actually know what that is, but since “tensorflow” is kiiiiiind of a Big Thing(tm), seems like a good thing to go learn about.

In mathematics, tensors are geometric objects that describe linear relations between geometric vectors, scalars, and other tensors…
…tensor can be represented as an organized multidimensional array of numerical values
(wikipedia)

HOLY SHIT. Like, that is both obvious, and with so many very not obvious implications! Matrices (if you didn’t already kn0w) are kinda super duper awesome mathematic omnitools.

So — if you have one number, that’s a scalar. Like, how many bananas there are. If you have two (or three, or four…) numbers, that’s a vector; say, how many blocks north (the first number) and east (the second number) between here and say, the subway. The subway is then also some number of floors underground (the third number), and the train you’re trying to catch is some number of minutes away (the fourth number).

Matrices are then a group of vectors, like vectors are a group of scalars. The most common place I’ve seen them is in video game graphics mathematics; you have a matrix filled with information about the position, rotation, and scale of things. Position is a vector; rotation is a vector, scale (did you double the length of the thing?) is a vector; you can (and usually do) put them together into a matrix.

The really super neat thing about vectors and matrices (and thus tensors, being a collection of matrices) is that, by intelligently picking where you put each number (say, how many blocks north and east the subway is), you can then use some other math to learn more about those numbers (say, how far it is if there weren’t buildings in the way). So then people come up with rules for where in the vector/matrix you should put each kind of number, so that when you use the tools, what you learn actually corresponds to reality in some way.

So anyway. Tensors being groups of matrices is kinda simple, but with vast possibilities, and that’s super neat. (Matrices are kinda super awesome)

The rest of this section (up to 3.3) looks to me like “how we connected the different parts of our neural net” (the hip bone connects to the thigh bone…).

Stage-II GAN

(reads up to “Model Architecture”)

This first paragraphs is just repeating what we already know: Stage I sketches things out, Stage II fills in the details. Between that and the differences between those first equations in this block (#5 & #6), and the equations in the prior section, it looks like we swap out the random noise (the “muse”) from Stage I for the rough sketch produced by Stage I — but otherwise the math is the same. The researchers are, however, careful to point out that the the layers of Stage I and Stage II are separate, as they need different information from the picture’s description.

Indeed, if we look at the math equations (which are used to try to write down how all the different pieces of information relate), there’s only two differences. “Z”, the noise vector, is replaced with “S-zero”, the output of Stage-I; and “u-zero” is replaced with “u”; but I can’t find a description of what that term actually is :/

(reads “Model Architecture”)

And again, the rest of the section is about which components they used, and how those components connected. At this point I recommend looking back and forth between both “Model Architecture” sections, and “Figure 2”. One thing to notice is how many times the text phrase is injected into the network.

Implementation details

This looks extremely informative if you have any idea what it’s like to implement a neural net, which I actually do not have (yet!). Something to remember to come back to after doing some tutorials to actually implement these things.

Experiments

Okay, they’re using a couple datasets (one of birds, one of flowers); they’re comparing two different competing make-a-picture AIs to StackGAN; and they’re also looking at each chunk of the Stack-GAN network independently, as well as the whole. This is pretty neat, as it means they poked at it a bunch of different ways. (Poking things is very important to Science(tm))

Datasets and evaluation metrics

Okay, this section seem to be where “gotchas” can come up. It looks like they modified many of the images to emphasize the bird (and did not have to modify the flower images); this implies this technique might not work as well if the training images are too “zoomed out”. However, they also, apparently, randomly fuck with the images (crop and flip) and with the text, which strikes me as a method to prevent over-reliance on clean data; but, I wouldn’t actually know. (Although, the method they used to “generate the corresponding text” seems important, and there isn’t much detail there.)

Evaluation Metrics

Sounds like humans still win at this step, although there’s a contender in “inception score”; I don’t know that it, but there’s a reference. Let’s see what we can learn from the reference! Title is… “Improved techniques for training GANs“. Google says… it’s this paper. Skimming the abstract… no mention of “inception score”; looking inside the paper… Hmm. A bit more informative, but the core equation is the one already in our paper:

Oh! That D-k-l term is back! Apparently, that’s the Kullback–Leibler divergence; so, neat! We figured something out from the prior equation, filling a gap with something in the future. (This is why you shouldn’t be worried about not understanding everything to start with.)

What’s neat is that, apparently, the inception score doesn’t measure actual success — you still need humans for that- but does measure if the image is meaningful enough to possibly be successful. Which sounds to me like it’s way for the computer to know if something has meaning, even if it has no idea what that meaning actually is. Which is pretty cool!

Quantitative and qualitative results

(looks over the images on the next couple pages)

TL;DR — It worked! And did better than the competing algorithms we selected (which I guess makes those the control groups).

But, these two pages are mostly images, so let’s take a moment to look over those and see what there is to notice.

So, interestingly — StackGAN seems to have some issue with the bird’s legs and the tree branches. I wonder if this is because, unlike the beaks, the legs aren’t described — ah, nope, the legs totally are. No clue why, then.

I also notice that the bird images are more “painting-esq” than the flowers; the flowers look suuuuuper realistic. I wonder if this is because flowers are more fractal in nature; filled with and made up of repeating patterns? Which makes me wonder if the birds would do better with a high-res image set, since then could pick out the individual, repeated feather patterns…

I next notice that the Stage-I images, and the GAN-INT-CLS images are pretty similar. Which makes me wonder if it’s possible to swap out the existing Stage-I system for the GAN-INT-CLS system….

The “nearest neighbor” figure is pretty neat; gives a good idea where the AI got its training from. Also, it’s pretty cool that you can just say “give me the images most similar to this one”, but AFAIK that is where Google Image Search got started in the first place, so makes sense.

Back to the text!

(reads section 4.2)

Stage-II GAN is able to correct the defects of Stage-I results by processing the text description again. For example, while the Stage-I image in the 5th column has a blue crown rather than the reddish brown crown

Okay, that’s cool.

Importantly, the StackGAN does not achieve good results by simply memorizing training samples… nearest neighbors from the training set can be retrieved [and inspected]… generated images have some similar characteristics with the retrieved training images but are essentially different

Oh! So there was a deeper purpose behind the nearest-neighbor check. That makes sense.

Component Analysis

(reads first two paragraphs)

Okay, here’s a particularly neat bit:

By decreasing the output resolution from 256⇥256 to 128⇥128, the inception score decreases from 3.70 to 3.5
all images are scaled to 299 by 299… therefore, if our StackGAN just increases the image size without adding more information, the inception score will stay the same… [demonstrating] that our 256 by 256 StackGAN does add more details.

So my read is basically: How do we know that the Inception Score isn’t just considering the image size? Well- take the image size out of the situation by scaling everything the same. Which is pretty neat, although IMHO they should have tried a few different scaling algorithms (cuz it’s actually not quite a simple thing to do), and see what happens then.

For the 128⇥128 StackGAN, if the text is only input at the Stage-I GAN (denoted as “no Text twice”), the inception score decreases from 3.35 to 3.13

Oh, neat! So, one of the core parts of Science(tm) is boiling things down to the their essentials. A good way to do this is to remove stuff and see what happens; if nothing changes, the thing you removed wasn’t important. If something does change (which it did here), that thing is important.

(reads “Conditioning augmentation”)

While all inputs of StackGAN are fixed, conditioning augmentation samples different conditioning variables from the Gaussian distribution controlled by the text embedding.

This sentence is causing me some confusion. It helps a bit to realize the “samples” in the sentence is a verb, not a plural noun; but I’m still lost. So, I went looking back through the paper for “conditioning augmentation” and “conditioning variables”, and found a hit back in section 3.1; apparently, each time they sample from the Gaussian, they sample randomly.

I also need a refresher on “latent manifold”; looking back through the paper, I think that’s how the information from the text (“this bird is…”) get encoded in a way the AI can access it. So what this means, I think, is a) there’s more than one way for the system to interpret the text and b) there’s another source of randomness than the noise vector; I got this by looking back at Figure 7, which is demonstrating what’s going on here.

(reads “Sentence embedding interpolation.”)

Okay, so now they’re freezing the noise vector, but changing the sentence; basically seeing how the system responds to a different request, but with the same “inspiration”. Then, they change the sentence a bit at a time into another sentence, seeing what images pop out; and what pops out is a gradually changing image. This is pretty neat!

“Smooth latent data manifold” means, I think, that there’s no discontinuity (in the math sense) in the system’s understanding of sentences; this basically means a) “you can get there from here” and b) “you can get halfway between here and there”, or 1/4, or 1/8, etc…

We do observe that StackGAN prefers to generate a simple image without parts and details if the text descriptions are too simple.

Oh, neat! This reminds of that thing where if you have a name for a color, you can better see that color, and if you don’t have a name for it, it’s harder to see. Read more on that here, among other places.

Conclusions

Oh my god we made it! The end! Okay, let’s see what the conclusions are…

(reads paragraph)

…yep, pretty simple to understand now. Let’s paraphrase:

“We made a system that first makes a sketch from the words, and then fills in the details and corrects any mistakes in the sketch. We tried measuring how well it worked a bunch of different ways, and it worked better than existing methods.”

Success!

Wait… (scrolls past “References”)

Supplementary Materials

Oh! More pictures to look at!

(looks at pictures for awhile)

I notice that some of the bird pictures are worse than others, but it’s hard for me to determine any kind of pattern. I mostly want to say that the quality of the Stage-I image has a heavy effect on the Stage-II image, but that’s hard to quantify.

I notice the flowers are pretty much all excellent. I wonder if that’s because this method works better with flowers than birds, or because of differences in the data sets…?

Failure Cases

NEAT! It’s real important to go over how things don’t work. Otherwise, how do you when not to use something? Or what to improve about it?

The main reason for failure cases is that Stage-I GAN fails to generate plausible rough shapes or colors of the objects.

Hmm. Makes me wonder if there’s way to add a check against this? Maybe something like the Inception Score as a filter between Stage-I and Stage-II?

So! We did it! We read the whole bloody thing and put in some serious work to (or, at least attempting to) understand every part. Go us!

So the original idea of all this was to show people how I approach something I know very little about, and that you can, in fact, totally jump in the deep end and come out with understanding. Mostly because there’s so much information already present on the Internet, you can just go look up what you don’t understand; sure, it might lead down a rabbit hole and take forever, but you’ll eventually succeed at figuring it out.

I think this mostly worked (from my perspective), and while I’d like to do it again, going for the literal “read along” doesn’t seem like the best choice. Suggestions welcome!

Ranger Science

PS —I set up a Patreon! Seriously don’t put in much if you put in anything, but it’s probably the best way to show that I should definitely do this again… although I’m also definitely going and doing this again.

Let’s Read Science! “StackGAN: Text to Photo-Realistic Image Synthesis”

Written by Nick Barone