A journey through multiple dimensions and transformations in SPACE
This is an expanded transcript of a guest lecture I gave at the School of Ma in Berlin, in July 2016, during Gene Kogan’s “Machine Learning for Artists” course. As such this talk aims to give a very high-level, very particular perspective on ‘how Deep Learning works’, and is aimed at artists and creative folk who may or may not be well-versed in maths, computer science or machine learning (though it does assume the most basic knowledge of what an artificial neural network is). Many thanks to Mike Tyka and Jamie Ryan Kiros for taking the time to provide valuable feedback on this post. In addition to my general research in this field, some of the ideas below took shape whilst I was resident at Google’s Artists and Machine Intelligence program in Spring 2016. I’m finishing up two more (much shorter) posts regarding the specific work I made there, hopefully will be online shortly.
Since this post ended up being quite long (~1 hour read!), I’ve listed the sections and approximate durations below.
1. Introduction. 1327 words, 5 minutes
2. Space: The Final Frontier. 1095 words, 4 minutes
3. Pixel space. 1305 words, 5 minutes
4. Manifolds. 1715 words, 6 minutes
5. Features and Representations. Observable vs Latent. 1406 words, 5 minutes
6. Introduction to Latent Space with Eigenfaces. 2700 words, 10 minutes
7. Face synthesis with (Variational) Auto-Encoders. 1412 words, 5 minutes
8. Geometric operations on faces in DCGAN space. 1006 words, 4 minutes
9. Latent Anything. 208 words, 1 minute
10. Word Embeddings. 1304 words, 5 minutes
11. Visual-Semantic Embeddings. 583, 2 minutes
12. Thought Vectors, 132 words, 1 minute
13. Neural Story Teller. 1204 words, 5 minutes
14. Meta-conclusion (Deepdream, Neural Style Transfer). 703 words, 3 minutes
15. Conclusion. 746 words, 2 minutes
In this talk I’m going to briefly cover quite a few machine learning techniques, old and new — such as PCA, eigenfaces, auto-encoders, DCGAN, word embeddings (e.g. word2vec), thought vectors, visual-semantic alignment (i.e. caption generation for images, or image generation from text), neural story-teller, deepdream, neural style transfer etc. But I’m not going to talk about how they work on an implementation, nuts and bolts level. Instead I’m going to provide a high-level framework for conceptualising all of these operations, which will hopefully provide some insight with which you’ll be able to visualise these otherwise complicated, mathematical and rather opaque procedures in a more intuitive way. The core of this approach is to think of all of these operations as journeys through multiple dimensions and transformations in space.
To actually implement these algorithms, it is of course essential to understand the lower level nuts and bolts, the maths, the architectures, training algorithms, hyper-parameters etc. But given a decent implementation and API, I don’t think it is necessary to understand the lower level nuts and bolts just to use the algorithms. Having a high level understanding of the concepts, such as the one I hope to outline today, may be enough. Furthermore, I don’t think innovation or interesting new work is necessarily going to come from playing around at just the low, nuts and bolts level. It’s more likely to come from thinking about interesting, fresh and ‘simple’ high-level conceptual approaches — and then trying to figure out how to implement them.
E.g. Some of the things I’ll be talking about:
1.1 A crash course in Deep Learning
A popular simplified description of Machine Learning is to think of it as an algorithm which learns to map an input to an output based on example data. E.g. an algorithm learns to map pixel values (i.e. an image) to a ‘cat’ label or a ‘dog’ label; or a particular gesture produces a desired sound; a chunk of text produces more text in the same style; a specific state of a Go board produces the ‘optimal’ move from that state; signals from an autonomous car’s sensors produce the ‘optimal’ response in steering and braking etc.
I want to talk about a slightly different approach to thinking about these techniques, a higher level approach. An approach which by no means did I invent, but I would like to emphasise. But first I will talk a little about Deep Learning with the more ‘traditional’ narrative.
1.2 Blocks of functions
In Machine Learning, when we say ‘the algorithm is learning’, what we mean is the algorithm is learning a function. And Deep Learning is called ‘Deep’, because it has many layers, and every layer in a deep network has its own function. So Deep Learning isn’t about just learning one function, it’s about learning many many functions. And the deep network in it’s entirety is one very complicated, highly non-linear, composite function that embeds each of the layer functions into each other, sometimes with feedback or other complicated bi-directional connections.
There are so many ‘nodes’ in these networks nowadays that no one really tries to visualise them as multi-million-node-networks. Yes they are ‘neural networks’, but it’s often easier to think of them as just sequences of functions — ‘high-dimensional’ functions that can take lots of numbers (i.e. vectors, matrices, tensors, more on this later) in one go. E.g. 1,000,000 numbers go into the first function (e.g. pixels of an image), and gets transformed into a new 2,000,000 numbers on the second layer. That gets fed into a another function and gets transformed into another 2,000,000 numbers on the 3rd layer. That gets fed into another function and gets transformed into 500,000 numbers etc. All the way to the last layer, the output, where the last function takes say 5,000 numbers and outputs 1,000 numbers, which specify the probability that the image is a cat or a dog or a submarine or 997 other categories. What do the intermediate functions or numbers represent? It totally depends, no one can say anything for certain — but it is a very hot area of research, and its the main topic of my talk today.
So it’s often easier to think of a deep network as a kind of flowchart (technically called a graph). Each ‘layer’ of a deep neural network, each potentially with hundreds or thousands of neurons, becomes a single ‘block’, a high-dimensional function, operating on lots and lots of numbers in one go (i.e. vectors, matrices, tensors). And data just flows through this graph.
At every layer a function takes the big list of numbers from the previous layer and transforms them into a new list of numbers, a new representation, and sends it on along the graph.
The learning algorithm’s job is to figure out what these functions are. These functions are parameterised with potentially millions, if not billions of parameters — the weights of the network. The learning algorithm tries to learn those functions — i.e. the millions or billions of parameters — based on the constraints that we give it:
- the training data — sure training data is ‘out there’, and all we have to do is collect it. But the data that we feed the learning algorithm will completely shape the outcome. It’s our responsibility to make sure the data is suitable — whether we have sufficient data or not, whether it’s biased, whether we have to pre-process or augment it somehow etc.
- architecture and hyper-parameters — the architecture of the network (i.e. the ‘graph’ of functions), and other settings will totally affect how the learning algorithm behaves and whether it will be ‘successful’ or not.
- I mention ‘successful’ learning, but what does it even mean to have a ‘successful’ learning? We obviously want our model to ‘learn’ from our training data, and make predictions that are very ‘close’ or ‘similar’ to it. But what exactly does ‘close’ or ‘similar’ data mean in our specific context? It’s the objective (aka cost aka loss) function that we define which is effectively us telling the algorithm what we value. It’s what we want the learning algorithm to try and learn, what the learning algorithm’s motivation is. The decision to use L1 norm, mean-squared, cross-entropy or many others or a custom variant or something completely bespoke, and what exactly we feed into the cost function is our decision and will affect what and how the network tries to learn.
1.4 Journeys through space and multiple dimensions
So there’s a lot of painstaking tweaking required, to delicately fine-tune all of these constraints so that the algorithm can learn the functions as desired.
I’m not going to talk about any of that process. That’s what most of the papers and books and tutorials talk about.
I’m going to talk about a much higher level view, about journeys through multiple dimensions and transformations in space, leaping from manifolds to manifolds. Because I think it’s way more fun, and perhaps equally important.
The key point is that any piece of complex data, whether it’s an image, a word, sentence, thought, sound, gesture, molecular structure, gene, state of a Go board, signals from an autonomous car’s sensors etc can be thought of as a high dimensional vector, a single point in a high dimensional space. And then we set a bunch of algorithms loose to learn many new representations of that data, to learn transformations into different spaces with different dimensions. And our key role in this should be, before we set the constraints and boundaries in which we allow the ‘learning’ to happen, is to decide what those spaces should be — what are the spaces and dimensions that we want to be able to transform between.
2. Space: The Final Frontier
So I’d like to start with space.
When we think of space, we might think of something like this:
Or perhaps this:
Or even this:
All of these ‘spaces’ are the same kind of space: three-dimensional euclidean space. That’s a fancy name for our ‘normal’ kind of space, with three ‘spatial’ dimensions related to distance (e.g. length, width, height).
In Maths the word ‘space’ is much broader and there’s tons of other types of spaces, likewise in physics and other sciences. In computational sciences, we might find pixel space, search space, solution space, state space, latent space, semantic space, embedding space etc. I’m going to talk about a few of these that come up in Machine Learning, and especially focus on what it means to transform between these different spaces, and different dimensions.
2.1 A Broader Definition of Space
In very non-technical terms, ‘space’ simply defines all of the possible options that a particular thing can be.
E.g. if you want to go to dinner with your friends, and you’d like Italian food, and you don’t want to cycle for more than half an hour: then your solution space is simply the list of Italian restaurants within half an hour cycle. If you want it to cost less than a certain amount, then you need to go through each of those Italian restaurants and check the prices. So your search space becomes the list of Italian restaurants within half an hour cycle. And your solution space becomes the list of Italian restaurants within half an hour cycle, which fit your budget.
This is basically just terminology. And it might sound very obvious, but it builds up to something very powerful.
One thing which is very important to understand about these kinds of spaces, is that while our real world ‘spatial’ space is three dimensional (i.e. 3D), these spaces can be any number of dimensions (i.e. n-D). E.g. when I talk about ‘a list of restaurants’, that’s a one dimensional (1D) space — because a list is one-dimensional, you can move along a list in only one direction. A paper map is 2D, because we can move in two directions (to be more accurate, the space can be defined by two ‘orthogonal’ directions, i.e. 2 directions which don’t overlap in any way, e.g. left-right vs up-down). Similarly, our world is three-dimensional (3D), because we can move in three orthogonal directions. However we will be dealing with spaces with many more dimensions, tens, hundreds, thousands, even millions or billions of dimensions.
2.2 Summary of 3D space
First I’d like to summarise a few things about 3D space.
This is 3D space. By convention we have arbitrarily called the axes x,y and z, but they could have been called anything. If we have three numbers — any three numbers — we can plot a point in 3D space. We treat those three numbers as measurements along each axis. E.g. to plot the point (2, 4, 7) we would move 2 units along the axis we call x, then move 4 units along the axis we call y, then move 7 units along the axis we call z. And we plot a little point there, at that location.
We can also do the reverse. If we have a point (or location) in 3D space, we can measure its distance along the x-axis, its distance along the y-axis, and its distance along the z-axis. And those are the point’s coordinates. Those three numbers represent that point. (For the more technically inclined, we can also call these projections on each axis. Which is equivalent to the dot product between the point vector and each axis)
So a point in 3D can be represented as a list of 3 numbers, its coordinates, e.g. (2, 4, 7). This is also called a vector. A vector is often defined as something ‘with a magnitude and a direction’, e.g. ‘velocity’ or ‘position’ (relative to an origin). In some programming languages (e.g. C++) a ‘vector’ is a ‘list’, or an ‘array’ of something. Both these definitions are different ways of saying the same thing (NB. artists / designers might also think of vector images or graphics, like Adobe Illustrator files. They’re actually called vector images because they store the coordinates — i.e. vectors — of the points, lines, shapes etc. as opposed to storing pixels).
Whenever I say vector, simply think of it as a list of numbers.
The length of this list can vary. A vector’s length determines it’s dimensions. E.g. (2, 4, 7) is a 3D vector. So is (3.6, -6.3, 7.49).
(1.3, -51.8) is a 2D vector.
(5.24, 7.61, 255.4, -69.674, 5.6, 2) is a 6D vector.
And most importantly, an n-dimensional vector is simply a point (i.e. a location) in n-dimensional space. A 3D vector is a point in 3D space, a 100D vector is a point in 100D space, a 2000D vector is a point in 2000D space.
An n-dimensional vector is simply a list of n numbers, which is a point (i.e. location) in n-dimensional space.
2.4 Mental Visualisations of High Dimensional Spaces
As I mention 100D space, 1000D space, you may be wondering what that looks like, or even means. Mathematically, a lot of the concepts and operations which we can do in two or three dimensions, we can easily do in any number of dimensions. We don’t always call the axes x,y,z. We just call them whatever we want to call them, whatever the features of the space are. For a 4D space, we could call them monkeys, donkeys, dolphins and aubergines if that’s what defines our problem.
But first I should give some advice on how to think about or mentally visualise high dimensional spaces. It’s not easy. Geoff Hinton, one of the superstars of Deep Learning, has a tip for imagining high dimensional spaces, e.g. 100D. He suggests first imagine your space in 2D or 3D, and then shout 100 really really loud, over and over again. That’s it, no one can mentally visualise high dimensions. They only make sense mathematically. (NB. I’ll talk about dimension reduction for visualisation like t-SNE in a bit).
So in the upcoming sections I’m going to demonstrate the concepts in 2D or 3D, and you have to imagine that it’ll kind of look or behave similar in higher dimensions (though our 2D/3D intuition can sometimes misguide us in very high dimensions!).
Mentally visualise in 2D or 3D. Think n-D
3. Pixel Space
For example I’ll start with pixel space. Pixel space is what we would call a space in which the dimensions (i.e. features) of the space are pixels. If this makes no sense don’t worry, examples should hopefully make it clearer.
Let’s take a 32 x 32 pixel black & white image (i.e. each pixel can be either black or white, but not grayscale).
There are 32 * 32 = 1024 total pixels in an entire image.
The total number of possible images can be found by:
- 2 * 2 * 2 * … (multiplied as many pixels in the image, i.e. 1024 times) =>
- 2¹⁰²⁴ =>
- ~1.8 * 10³⁰⁸
That’s a lot of possible images that can be generated with only 32 x 32 BW pixels. (compulsory comparison: that’s way larger than 10⁸⁰, the estimated number of atoms in the universe)
3.1 Every Icon
This is an artwork called “Every Icon” by John F Simon Jr, from 1997. You can see it in action here (it has nothing to do with Machine learning). It’s systematically going through every possible combination of pixels in a 32 x 32 grid, all 10³⁰⁸ of them. Most of these combinations of pixels will be some kind of noise, but it will eventually stumble upon an image of me, an image of you, an image of you and I riding a unicorn, an image of you and I riding a unicorn with our credit card details overlaid on top etc. Basically every single image that can be conceived in a 32 x 32 BW image. Assuming it generates 100 images per second, it will take about 1.5 years to go through every possible combination of just the first row. It will take another 6 billion years to get to the end of the second row. How long will it take to reach the end of the bottom row (i.e. generate every single possible 32 x 32 BW image)? If I’m not mistaken that’s in order of 10²⁹⁸ years. That’s a very very long time (about 10³³ times longer than the age of the universe).
But here’s the amazing thing:
We can think of every single possible 32 x 32 BW image — all 10³⁰⁸ of them — as a single point (or location), in a 1024 dimensional space. A tiny little single speck of a point.
And this artwork is systemically exploring every location in that 1024D space.
We can unwrap the pixel values of a 32 x 32 image into a single list of 1024 numbers. (NB. It doesn’t matter how we do this. We could unwrap the image row by row, or column by column, or whatever. As long as we’re consistent when unwrapping from 32 x 32 -> 1024, and back from 1024 -> 32 x 32).
After we do this, we have a 1024D vector — a list of 1024 numbers. And remember that any point in 3D space, represents three numbers: x,y,z (analogous to width, length, height etc), it’s a 3D vector. Similarly any point in 1024D space, represents 1024 numbers, it’s a 1024D vector.
So this ‘Every Icon’ artwork, by systematically exploring every combination of 32 x 32 pixels, i.e. every combination of 1024 numbers, it’s systematically exploring, travelling around a 1024D space.
3.2 1024D Pixel Space
This is actually a 1024D pixel space.
We will call each axis (i.e. each dimension, i.e. each feature of this space), pixel1, pixel2, pixel3… all the way up to pixel1024, so each axis corresponds to a pixel in a 32 x 32 image. (NB. Which axis corresponds to which pixel doesn’t really matter, as long as we are consistent. It depends on the way we ‘unwrapped’ the image in the previous section).
If we have any 32 x 32 BW image, we can simply go through each pixel of the image, read the pixel value and ‘move’ that amount along the axis which corresponds to that pixel. I.e. Read value of pixel 1 and ‘move’ that much along axis 1, read value of pixel 2 and ‘move’ that much along axis 2, read value of pixel 3 and ‘move’ that much along axis 3, etc. Once we’ve been through all 1024 pixels, we will be at some location in 1024D space.
That single point (or location) represents our image.
Likewise we can go in reverse, just like we did in 3D. If we have a point in 1024D space, we can measure its ‘distance’ along each axis (for the technically inclined, that’s a dot product between the vector and each axis), and we get its 1024 coordinates. These are the 1024 pixel values. We just need to ‘wrap’ those 1024 values back into a 32 x 32 grid, and hey presto we have an image.
i.e. coordinates == pixel values ( == representation)
This might all sound rather abstract and boring. But it is a really fundamental and important concept. It’s simultaneously very simple, but also complex. So if you’re finding it simple and obvious, I apologise for going on about it for so long. And if you’re not finding it simple, but rather complicated, that’s also understandable, as it is quite an abstract cognitive leap, especially if you’re wrestling with how to mentally visualise 1024D. Don’t.
Mentally visualise in 2D or 3D. Think 1024D.
And this is why it’s interesting to look at it this way: The geometric, or mathematical operations that we can do in 2D and 3D, we can do in any D, and the maths and the code is almost identical (it’s worth noting that operations in higher dimensions take longer and require more memory of course, but otherwise it’s mostly the same concepts and code).
3.3 Operations in Pixel Space
Imagine that we have two points in this 1024D pixel space, which correspond to two images. One point is a 32 x 32 BW image of my face, and another point is a 32 x 32 BW image of your face. What happens if we measure the distance between the two points in 1024D? Or what happens if we calculate the midpoint of those two points? We will have a new location in 1024D space, which will correspond to a new image. What will that image be of? What if we have 10 points (i.e. 10 images), and we find the midpoint of all of those points? We will have a new location, i.e. a new image. What will that image be of?
Now these operations I just mentioned will be done on an element by element level. And because we’re operating in pixel space, these operations will be done pixel by pixel. I.e. the midpoint between two points (i.e. two images) will be the mathematical equivalent of mixing the images pixel by pixel, i.e. the pixel average. Which is the same as just mixing images in Photoshop (via opacity). So we won’t get terribly exciting or interesting results.
However, even though the results aren’t particularly ground breaking, it still makes sense to think of the operations as ‘(p1 + p2) / 2’ or ‘|p1 - p2|’ etc. (where p1 and p2 are points in this 1024D pixel space, i.e. images).
More importantly, we can do these types of operations in other high dimensional spaces which aren’t pixel space. And that’s when it gets really interesting, and that’s what the rest of this talk will be about.
But before that, what happens if we draw a circle in this 1024D pixel space?, and sample a bunch of points along that circle? Each of these sampled points will in effect be images. Will we find anything special about these particular images? Any interesting relationships between them? Or what if we don’t draw a circle, but some other shape? In fact what does it even mean to ‘draw a circle’ in 1024D?
This is a good excuse to take a quick diversion into manifolds. I’m not going to spend too much time on this but I’d like to introduce the concept as it’s vital to understand for the rest of this talk. (A great read on this subject — and a bit more mathy — is the ever insightful Chris Olah’s post).
‘Manifold’ basically means ‘shape’ or ‘surface’. There are some constraints as to what kind of a shape constitutes a manifold, but for the purposes of this discussion, we can call a manifold the generalisation of a shape or surface in any number of dimensions.
4.1 In 3D
Usually we talk of a lower dimensional manifold embedded in a higher dimensional space. This may sound complicated, but it’s actually quite simple. The surface of a piece of paper is two dimensional. I can take a piece of paper and crumble it up. The piece of paper itself occupies three-dimensional space, but its surface is still two dimensional. It’s a two dimensional manifold embedded in three dimensional space.
(Note that we can potentially have infinitely many different 2D manifolds embedded in 3D space, the same way that we could find infinitely many different ways of crumpling up a piece of paper).
Likewise the Earth is a 3D object. It’s a squashed sphere slightly bulging at the equator, flying around in a 3D universe (ignoring the whole fabric of 4D space-time continuum for now). It’s also all curved and wrinkled up even more in 3D with mountains, valleys, hills etc. But the surface of the Earth can actually be thought of as 2D. It’s also a two dimensional manifold embedded in three dimensional space.
When you snowboard down a mountain, you’ll be moving through the universe in 3D, but your trajectory can be thought of as 2D in the space of the mountain’s surface (assuming you don’t get any air). One could ‘unwrap’ the surface of the mountain and make it ‘flat’, and plot your trajectory in 2D. We can also talk of a transformation, that transforms (or maps) your trajectory between these two spaces.
This is perhaps easier explained with the crumpled paper example. If there is a drawing on our crumpled piece of paper, e.g. a drawing of a house, that drawing would have a path (i.e. representation) in 2D. This is the original drawing on the flat piece of paper. It would also have a path (i.e. representation) in 3D, in the space in which the crumpled paper lives (i.e. is embedded). There is a transformation (i.e. a function) which maps between these two spaces. E.g. the door of the house in the picture has a location (i.e. a point, coordinates, representation) on the surface of the paper in 2D. It also has a location (i.e. a point, coordinates, a representation) in 3D, in the universe in which the crumpled piece of paper lives.
A Transformation is a function which maps from one space to another.
And this is the basis of everything in my talk today.
4.3 Higher dimensions
The same concept translates to higher dimensions.
We can talk about a 2D manifold embedded in 5D space, or a 20D manifold embedded in 1024D space. What exactly does this mean? It means that in a 1024D space, there exists a 20D surface — like a 20 dimensional crumbled piece of paper or mountain range.
And remember that in pixel space, each point (i.e. location), is a pixel. So perhaps there is a manifold (i.e. crumpled piece of paper) in our 1024D pixel space, such that every point on this manifold is an image of a face. Or perhaps there is a different manifold such that every point on that manifold is an image of a cat. Or another manifold which corresponds to images of dogs. Or perhaps yet another manifold which corresponds to images of animals. This manifold would of course include the previous two manifolds. I.e. the manifold of all possible animal images could be thought of as the Himalayas; the manifold of all possible cat images is the Ladakh Range (a submanifold of animals / Himalayas); and all possible dog images the Mahalangur range (a different submanifold of animals / Himalayas). Then perhaps all possible images of my dog Ruby is Mt Everest (a submanifold of dogs), and this pic would be one single particular point somewhere on the surface of Mt Everest.
There are 1.8 x 10³⁰⁸ possible points (i.e. 32 x 32 BW images) in our 1024D BW pixel space. That’s a lot of points (i.e. images). And a lot of these images are not really what we would call humanly-recognisable images, most of them are just noise. So working in this insanely large search space can be inefficient. There’s a lot of dead space, like the vast empty space in the universe. If we can narrow down the set of points we want to work with to a subset space, better still — work on a lower-dimensional space, a manifold, we can make our computations more efficient — and potentially more interesting and meaningful. And in the rest of the talk I hope to demonstrate what I mean by ‘meaningful’.
4.4 Deep Learning (and manifolds)
Instead of searching the whole universe for images of dogs, if we know (and are certain) that images of animals are constrained to the surface of the Himalayas manifold, if we could find the surface of the Himalayas manifold, our life would be much easier. In fact if we could unfold the surface of the Himalayas manifold into a ‘flat’ 2D surface, and operate in that 2D space, we might find interesting new insights and do things that we couldn’t do in the full 3D space of the universe.
That single point on Mt Everest, which corresponds to a particular image of Ruby, has high dimensional coordinates in the space of the universe. It also has lower dimensional coordinates in the space of the surface of the Earth, also in the space of the surface of the Himalayas, and in the space of the surface of the Mahalangur range, and Mt Everest. Each of these are manifolds, with transformations that map coordinates between them.
When we are working with machine learning, we are effectively always transforming between these different spaces, which are often (but not always) of different dimensions. However we are not necessarily explicitly trying to formulate the mathematical structures of the manifolds themselves. But it is worth noting that there is a field of Machine Learning called Manifold learning, which does explicitly try to learn manifold structures. Its based on a hypothesis that a set of natural data points in high dimensional space (e.g. a bunch of images of faces as points in 1024D) can be represented by a lower dimensional manifold embedded in that space (e.g. crumpled piece of paper, or mountain range).
Manifold Learning is not the topic of my talk. However, I think understanding the concept of manifolds and transformations through space can be crucial in understanding and visualising what is happening (or not happening) when we work with machine learning. And my whole talk is basically going to be examples of this.
This might sound insane. And it is. For complex problems— i.e. very high dimensional, with lots (thousands, millions, even billions of input features) — the system doesn’t try to learn this mapping from input to desired output in one go. It’s actually a lot more interesting than that. The reason Deep Learning is called Deep is because there are lots of layers in our neural network. And each layer is a transformation into a new space, often with new dimensions. Each lower dimensional layer can be thought of as a manifold embedded in the adjacent higher dimensional layer’s space. Piping data through a deep neural network, is like jumping from one manifold to another.
5. Features, Representations and Spaces. Observable vs Latent
5.1 Observable features and representation
Imagine I’d like to list the physical features of a person. E.g. I’d like to make a database of all the actors and actresses in a film, so the costume makers know how much material to buy and what to make and the casting director knows who to cast for what role etc. Some immediately observable features which come to mind are:
- height, weight, hair colour, hair length, eye colour, skin colour
A tailor might also measure:
- waist, chest, neck, arm, inside leg
A nurse might also measure:
- heart rate, heart tension
I’ve listed 13 features. There’s probably loads more, but I’ll stick with these for now. These are all observable features. They have a real world meaning or value that is easy to measure or observe. Heart rate or tension might not be immediately obvious, but they are clearly definable and measurable.
We can say that people have a representation in this physical features space (an arbitrary name I just came up with). This physical features space is a 13 dimensional space, one dimension per feature. And my representation in this space is simply a 13D vector, a list of values for each of my (height, weight, hair colour, hair length,…, heart rate, heart tension). You also have a representation in this physical features space, which is a 13D vector of your (height, weight, hair colour, hair length, …, heart rate, heart tension). We could plot ourselves as points in this 13D physical features space. Each feature (i.e. height, weight etc) would be an axis in this 13D space.
5.2 Latent features
There are many other physical features which are not directly measurable. E.g. gladiator-y-ness (how much of a Hollywood gladiator someone looks like).
This isn’t how strong someone is (i.e. what they can benchpress). Or how tall they are or how much they weigh. But how gladiator-y they look. We humans can look at someone and say person X is more gladiator-y-er than person Y, but less gladiator-y-er than person Z. This is a real (but subjective) feature which exists, but it isn’t directly measurable. It’s a latent feature (i.e. hidden).
The interesting thing about latent features, is that they’re usually some kind of a combination of some or all of the observable features. E.g. gladiator-y-ness is somehow related to height, weight, waist, chest, neck, arm etc. (It’s probably not related to hair colour, hair length, eye colour, skin colour etc — or maybe it is! I don’t know. For now I’ll assume not, but I could be wrong).
This means that we can use latent features to reduce features in a meaningful way. Imagine that we have a million people in our actor/actress database, and we’d like to cast a hundred actors and actresses for a gladiator army. We don’t have time to go through all million profiles. If we had a gladiator-y-ness value for each person in our database, we could simply sort by that value and pick the top hundred. We wouldn’t need to look at a million people’s height, weight, chest etc. For this particular task, we can replace 13 observable features by one latent feature.
So how would we come up with a gladiator-y-ness value for a person? How do we calculate this latent feature? We could try to naively formulate it by hand. E.g. make up something like:
gladiator-y-ness = 1.7 * height + 3.4 * weight - 1.4 * waist + 1.8 * chest …
(I completely made this up. please ignore numbers and logic).
Chances are if we try to formulate it manually like this it will be quite rubbish. This is exactly what Machine Learning (ML) is for. ML will learn how to calculate gladiator-y-ness based on examples that we give it. I.e. We design an ML system and provide it with examples of who we think look like gladiators, and other examples of who doesn’t look like gladiators. The ML system will learn a function that maps (height, weight, waist, chest, neck, arm etc.) to a single gladiator-y-ness value. (It might even learn that actually hair length does affect gladiator-y-ness). It will learn to transform (i.e. calculate) from physical features space, to gladiator-y-ness space (more on this in a bit).
Technical digression: The simple function that I manually wrote above (1.7 * height + 3.4 * weight etc.) is a linear function. I.e. with that formulation gladiator-y-ness is a linear combination of (height, weight, waist, chest etc). Because each component (i.e. observable feature: height, weight, waist, chest etc) is in the function once and multiplied by a constant number (e.g. [1.7, 3.4, -1.4, 1.8]). It would have been a non-linear function if it had been otherwise. E.g. gladiator-y-ness = 1.7 * height² + log(weight) / sqrt(waist).
Machine / Deep Learning will learn a non-linear function (if used appropriately), thus it can learn much more expressive and accurate functions. Or if used incorrectly, it can over-fit like crazy and not learn anything useful at all, in fact it could ‘learn’ something completely wrong which only satisfies the training data points, but goes crazy when given anything else (i.e. unable to ‘generalise’).
5.3 Representations, Spaces and Transformations
The physical features of a person I mention above: (height, weight, hair colour, hair length, eye colour etc.) are observable features, and constitute a representation of a person’s physical appearance. I have a representation in a 13D physical features space, this is a 13D vector of physical features (i.e. a list of values for my height, weight, hair colour, hair length etc).
The gladiator-y-ness of a person is a latent feature. My gladiator-y-ness value is a representation of me in a 1D gladiator-y-ness space, which is a latent space. This representation won’t provide much information about anything other than what it was designed to do, which is to provide information on gladiator-y-ness.
This could be combined with other latent features. E.g. angel-y-ness, sinister-y-ness, geeky-ness etc. All of these could be combined into a new 4D latent space (gladiator-y-ness, angel-y-ness, sinister-y-ness, geeky-ness) which I shall call a qualities space. Then I would have a new 4D latent representation in this new qualities space. This representation would be a 4D vector of latent features (i.e. a list of values for my gladiator-y-ness, angel-y-ness, sinister-y-ness, funnyness).
We can speak of transformations from observable representation (e.g. the 13D vector of physical features) to latent representation (e.g. 4D vector of qualities features), and back. These transformations simply calculate the values of the features of one space, given the values of the features of the other. I.e. given my 13D physical features representation (height, weight etc), if we know the transformations (i.e. equations) we can calculate my qualities (gladiator-y-ness, angel-y-ness etc). Note that there isn’t always a one-to-one mapping. Given somebody’s physical features, we may be able to calculate their qualities, but given their qualities, we may not be able to calculate their physical features. This is not uncommon when going from lower dimensions to higher dimensions (it’s a bit like calculating the 2D shadow of a 3D object, that’s a one-to-one function. But calculating the shape of a 3D object given just its 2D shadow is not a one-to-one mapping. There are infinitely many solutions).
To give another example, the pixel data — i.e. vector of pixel values — of an image are observable features. The pixel data is a representation of that image in pixel space. Images can also have latent features, and representations in latent spaces. In fact any type of data will have observable features, i.e. a representation in an observable space (which is what we measure in the real world). And it will also have latent features, a representation in a latent space (which can be a more ‘meaningful’ space, and is what this talk is about).
Going from observable representation to latent representation can be thought of as encoding. And going from latent representation to observable representation can be thought of as decoding.
6. Introduction to Latent Space with Eigenfaces
These are eigenfaces, a very simple demonstration of this. Going into detail as to how they’re generated is beyond the scope of my talk right now, but I’ll spend just a couple of minutes to give a very rough overview for the technically curious.
- There’s a large dataset of face images.
- These eigenface images are the eigenvectors of that dataset, found via Principle Component Analysis (PCA).
What does that mean?
6.1 Principle Component Analysis
(This bit can be skipped if a. you’re already comfortable with PCA or b. you don’t really care. If this section isn’t clear, don’t worry about it as afterwards I’ll summarise the conceptual significance, which is the bit that really matters).
Imagine we have a bunch of 3D data, and we plot them in 3D space, we get something that resembles a point cloud.
Now this point cloud might be perfectly spherical, but it’s more likely to be kind of elongated and blobby like in this image. The point cloud in this image is quite elliptical and is oriented in a particular direction. The directions in which the data point cloud is elongated in are called the principle components (or eigenvectors). And Principle Component Analysis (PCA) is a method of finding these directions in which the data is most ‘elongated’. Then we can define these directions as new axes, and project our data into that new axis system. I.e. transform it.
But here’s an important detail: PCA finds the directions of elongations (eigenvectors), and how big each ‘elongation’ is (eigenvalue) along that direction. We can then choose to omit any directions (eigenvectors) where the elongation (eigenvalue) isn’t that significant i.e. If it’s quite flat in a particular directions.
E.g. Imagine we plot a bunch of data in 3D space, and it turns out to be (almost) flat like a piece of cardboard, but tilted at an angle. If we can calculate that angle, we can transform our coordinate system, and reduce it to 2D. That’s exactly what we can do with PCA: reduce dimensions by transforming the data to a new axis system, one which potentially represents the data more optimally.
And as always, this works in any number of dimensions. If we have 100D data in 100D space, it might also have elongations. In fact, because it’s in high dimensions, it will probably have many elongations in many different directions and dimensions. PCA will find all of these elongations. In fact PCA will return the same number of elongations as there are original dimensions. I.e. For a 100D space, PCA will return a new set of 100 directions (axes). But these 100 axes will be rotated to fit our data more optimally. Most importantly, because we know how much elongation there is on each direction (the eigenvalue corresponding to the eigenvector) we can sort the axes by elongation amount. The first axis will have the most elongation (pointing in the direction of most variance), second axis will have the second most elongation (pointing in the direction of second most variance), etc. and the last (100th) axis will have the least amount of elongation. This means that we can choose an arbitrary cutoff point (for amount of variance), and just ignore the axes (dimensions) beyond that cut off point. The same way that we can transform 3D data that is ‘flat’ into 2D (by finding the most ‘important’ set of 2D axes), we can transform 100D data into say 20D data — by finding the most ‘important’ set of 20D axes.
If this wasn’t very clear, it doesn’t matter. Understanding what exactly PCA does isn’t the purpose of my talk. This bit was only for those who were interested and might have already seen this before. The important thing is to understand the implications of this which I will explain next.
6.2 Back to Eigenfaces
Let’s take our dataset of face images (which we assume to be 32 x 32 pixels, so that it ties in with our previous discussion). Remember that every single 32 x 32 BW image is a single point in 1024D pixel space. If we plot all of our face images, we get a point cloud in 1024D. We can run PCA on this 1024D dataset and choose an arbitary number of dimensions (i.e. axes) to reduce it to.
E.g. If we were to choose the top 24 dimensions, we might get something like this
Each one of these ‘face’ images, is an eigenvector of this dataset, i.e. the ‘directions’ in which our dataset point cloud is most elongated in 1024D. These are the new axes which represent our data set more optimally, in a more compact manner.
What does it even mean for ‘an image to be an axis’? Well, remember that in our 1024D space each point is an image. So each of these images here, is also a point in 1024D space. It’s a vector. And eigenface image 1 is our new axis 1, eigenface image 2 is our new axis 2, eigenface image 3 is our new axis 3… eigenface image 24 is our new axis 24 etc.
And this is conceptually really significant. Because first we discussed a 1024D pixel space. In that space, each axis (i.e. feature) corresponds to a pixel in a 32 x 32 grid — i.e. the features of the space are pixels.
Now (after PCA / Eigenfaces) we have a new coordinate system (i.e. new axes, which are somehow rotated in space) to fit our particular dataset better. These new axes constitute a 24D latent space — I call it latent space because it’s features (i.e. axes) are not directly observable. And these latent features are how much an input image resembles the eigenfaces. I.e. these are what the axes of this new latent space represent.
6.3 Pixel Space to Latent Space
I’ll give an example to try and make this a bit clearer.
This image of Hedy Lamarr has a pixel representation, a 1024D vector of pixel values. How do we transform it from 1024D pixel space to 24D latent space? How do we find its representation in this latent space? i.e. the 24D vector of latent features? How do we encode it?
With this particular latent space (i.e. the one we constructed via eigenfaces and PCA), it’s very simple.
We take the image (cropped and resized to 32 x 32, so it’s a 1024D vector) and dot product it with the first eigenface (which is also a 1024D vector), that will give us a number, how much the image ‘resembles’ the first eigenface. That’s the value of our first latent feature. i.e. if we were to plot this 24D representation as a point in 24D latent space, that’s the distance we would go along the first of the 24 axes. We then dot product the image with the second eigenface, that number will give us the second latent feature, i.e. the distance to go along the second axes. Etc. All the way to the 24th eigenface and the last latent feature (i.e. axis).
If you’re not familiar with dot products etc and this bit wasn’t clear, it doesn’t matter. The most crucial thing here is:
We have managed to represent any 32 x 32 black & white image of a face, with just 24 numbers.
It might turn out that this image of Lamarr is 24% 1st eigenface, 12% 2nd eigenface, -31% 3rd eigenface, …, 17% 24th eigenface etc. Or in a more compact syntax: [0.24, 0.12, -0.31, … 0.17]. That’s only 24 numbers! We would call each of these 24 numbers, the latent features of this image (in this particular latent space), and the vector (i.e. list) of 24 numbers is a representation of this image in (this particular) latent space.
6.4 Latent Space to Pixel Space
If we have a representation of an image in this 24D latent space, i.e. a vector of 24 latent features, how can we reconstruct the original image? I.e. transform from 24D latent space back to 1024D pixel space? I.e. decode it?
Remember that the latent features in this space are simply how much an image resembles each eigenface. So we simply multiply each of the 24x eigenfaces with the value of the corresponding latent feature, and add them up. I.e. for each pixel, we do:
eigenface1 * latent_feature1 + eigenface2 * latent_feature2 + … + eigenface24 * latent_feature24.
That’s it. The resulting 1024D vector is the pixel representation.
This is a huge compression of information.
If I want to send you a picture of a face, I don’t need to send you all of the pixels of the image, i.e. a pixel representation, i.e. a vector of 1024 pixel features. I can just transform my image from 1024D pixel space into 24D latent space. I encode it. And then I can send you just the 24 numbers, the latent representation, a vector of 24 latent features. Of course you need a way of decoding those 24 numbers, transforming from latent space back to pixel space. If you already have these eigenfaces handy, then you can easily transform back to pixel space as I described before.
But if you don’t have the eigenfaces, then I’d need to send them to you first, and that would be very inefficient for just one picture.
But if you don’t have the eigenfaces, and I want to send you a million 32 x 32 face images, sending pixel representations for all images would take up 1GB (1,000,000 images * 1,024 pixels per image, assuming 1 byte per pixel). Alternatively I could send you the pixel representations of the eigenfaces first which would be 24KB (24 images * 1,024 pixels per image). Then I could send the latent representations for each of the million faces which would be roughly 24MB (1,000,000 images * 24 latent features per image, assuming 1 byte per latent feature). A massive compression.
6.6 Lossy Compression
There is a catch associated with this. This is a lossy compression. Very lossy. If we take an image (e.g. Hedy Lamarr) and encode it, i.e. transform from 1024D pixel space to 24D latent space, we will end up with a 24D vector of latent features, a representation in 24D latent space. If we decode that, i.e. transform it from latent space back into 1024D pixel space, we will end up with an image again. We could call this a reconstructed image. But the reconstructed image will not necessarily be identical to the original input image (e.g. Lamarr). The ‘difference’ between the original input image and the reconstructed image is the error (of this encoding-decoding). There are many different ways of measuring this ‘difference’, and it depends on the domain. For an image like this, we could simply take the difference between all of the pixels and add them up (L1 Norm) or we could measure the euclidean distance in 1024D space (L2 Norm) etc. (For a more complicated, probabilistic model it’s more common to look at the ‘difference’ between probability distributions, e.g. using something like KL divergence).
There are two main reasons for this error:
- These eigenfaces are the eigenvectors of the original training dataset. They don’t represent every single face image ever to be conceived or imagined. If I try to transform an image (e.g. Hedy Lamarr) into this latent space, the less like my training dataset that image is, the bigger the error will be when I reconstruct. In fact we can transform any 32 x 32 BW image, any 1024D vector, into 24D latent space — even this nautilus — and calculate a 24D vector of latent features for it, just by dot producting the image with each of the eigenfaces. But if we then reconstruct a 1024D pixel representation back from that 24D representation, we will not get a nautilus, but what seems like an arbitrary face-like image (applying this to the nautilus is left as an exercise for the reader). In this case, this latent space is not capable of representing that particular class of data.
- However, even if this image of Lamarr were in my dataset, it might still not be perfectly reconstructed, because we capped the number of dimensions arbitrarily to 24D. That might not be enough dimensions to capture all of the detail required. A bit like jpeg image compression, or mp3 audio compression, we (almost arbitrarily) omitted detail which we decided was superfluous. So reconstructing the 1024D pixel representation from the 24D latent representation we are always likely to get something a little bit different than what we started out.
6.7 Operations in Latent Space
Let’s also remember that this 24D representation, the vector of 24 latent features, are coordinates in a 24D space. So each 32 x 32 pixel face image can be thought of as a point in 24D latent space (in addition to being a point in 1024D pixel space). Now what happens when we perform geometric operations in this 24D space? If we average two points (i.e. two latent representations of face images)?
The method I just described was using PCA and eigenfaces. PCA is a method dating back to 1901, and was applied to faces in 1987. It’s quite old, really not state of the art at all. Also PCA is a linear dimensionality reduction technique. I.e. the new (latent) features are linear combinations of the original features (e.g. pixels).
In other words:
latent_feature1 = pixel1 * K1_1 + pixel2 * K1_2+ … + pixel1024 * K1_1024
latent_feature2 = pixel1 * K2_1 + pixel2 * K2_2+ … + pixel1024 * K2_1024
latent_feature3 = pixel1 * K3_1 + pixel2 * K3_2+ … + pixel1024 * K3_1024
where all Kx_y are constants. PCA’s job is to find those constants.
So PCA won’t find complicated, intricate manifolds (i.e. crumpled pieces of paper, or intricate mountain ridges), or even slightly curved manifolds (like the surface of bowl). It will only find completely ‘flat’ manifolds (like a flat piece of cardboard). So even though finding midpoints of multiple image representations in this new latent space will be more interesting than doing it in the 1024D space, the results will still be linear combinations and not terribly exciting. However…
6.8 Final thoughts
I only showed and spent so much time on PCA / Eigenfaces because they’re relatively easier to visualise and understand what’s going on under the hood (compared to the ‘black-box’ of neural networks).
There are many other, totally different methods which essentially do what we want here, which is to…
…find transformations from ridiculously high dimensional input or feature space (e.g. pixel space), to a more manageable, lower dimensional, ‘meaningful’ latent space.
In the next sections I”ll talk about a few other methods which are considerably more complicated under the hood, so I won’t go into so much detail on how they work. I’ll focus mainly on the end result and how they work on a conceptual level.
But first I want to underline a few things:
- The number of dimensions in this latent space (24D) is arbitrary. We picked it when doing PCA. The more dimensions we pick, the more detail we will be able to express, but if we choose the number of dimensions too large, we might unnecessarily have too much data and memory and computation requirements to deal with. If we choose the number of dimensions too small, we will lose detail. The right number of dimensions completely depends on the problem we’re trying to solve.
- These eigenfaces are not ‘universal’ in any way. They are just the eigenfaces (eigenvectors) of this dataset on which we applied PCA. The more diversity we have in our dataset, the more likely the resulting eigenfaces will represent a wider audience.
- This is ‘a’ latent space, not ‘the’ latent space. There is no such thing as ‘the’ latent space. This is just an arbitrary set of reduced, latent features that we constructed for a specific, particular dataset. For the same dataset we can construct infinitely many different latent spaces, of different (or same) number of dimensions, using different methods, with different transformations to get to and from those spaces, and all for different purposes.
7. Face synthesis with (Variational) Auto-Encoders.
Here is a similar (at a very high level) demo using Auto-Encoder Neural Networks. Load this and have a play with the sliders.
Again, a detailed explanation of how auto-encoders work is not within the scope of my talk now as I mainly want to convey concepts. But a rough overview is something like this:
We have a neural network in which the layer dimensions gradually get smaller and smaller (e.g. in the image above first layer is 8D, then 4D, then 2D in the middle — note that the number of neurons is the dimensions of that layer). Then we mirror the neural network back to front so that it widens again with the output layer having the same number of neurons (i.e. dimensions) as the input layer. E.g. if this is for our 32 x 32 images, the input layer and output layer will have 1024 neurons each (8 in the image above). We would have more layers that gradually reduces neurons (i.e. dimensions, e.g. 500D, 200D, 100D etc). In the middle of the network, we will narrow the network down to an arbitrary small size, a ‘bottleneck’ layer, say 24 neurons (2 in the image above).
The left half of the network is our encoder. It transforms (or compresses) the 1024D pixel representation (a point in 1024D pixel space) down to 24D latent representation (a point in 24D latent space). The right half is our decoder. It transforms (or decompresses) the 24D latent representation (a point in 24D latent space) back to 1024D pixel representation (a point in 1024D pixel space).
NB. In auto-encoder notation the input vector (in this case a 1024D vector of pixel values) is often called x. The latent vector (in this case the 24D vector of latent features) is often called z. The output vector (in this case another 1024D pixel representation) is often called y.
If we were to push an image through this network, in the encoder section, x (1024D pixel space) undergoes a series of transformations with each layer (500D, 200D, 100D etc.) until it eventually gets transformed to z (24D latent space). In the decoder section, z undergoes a series of transformations with each layer (100D, 200D, 500D etc.) until it eventually gets transformed into y (1024D pixel space).
But what are those transformations at every layer? that’s what the training is for.
Given a large training set of images, we take every image and pipe them through the network one by one. Just like in the PCA example, where every latent feature is a (linear) function of every input feature (i.e. pixel), here also every latent feature is a (non-linear) function of every input feature. And the network tries to learn those functions. While training, we try to minimise the error (or ‘cost’, or ‘loss’ or ‘objective’), the difference between the output (the reconstructed image that the auto-encoder produces) and the original input image. I.e. the difference between x and y. In other words, we feed this network a bunch of images, and the network tries to
learn how to reduce the 1024 numbers of an image, down to 24 numbers, then back up to 1024 numbers again, such that for each training image, the output 1024 numbers match, as closely as possible, the input 1024 numbers.
This very roughly speaking has the same effect as the eigenface / PCA approach in terms of reducing dimensions and constructing latent features, but is very different in a few ways:
- In the eigenface / PCA approach the reduction from 1024D to 24D is linear. In the auto-encoder it’s non-linear. I.e. the results of this compression cannot be replicated through simple ‘mixing’ of images in Photoshop. It learns a much more complex representation. Thus, this might be a more optimal representation (i.e. able to reconstruct images in the dataset better, and/or generalise to new images better).
- In the eigenface / PCA approach the axes (i.e. latent features) ‘meant’ something to us. We could say that latent feature 1 (i.e. axis 1) is how much an image resembles eigenface 1, latent feature 2 (i.e. axis 2) is how much an image resembles eigenface 2 etc. Using an neural network like this, this is not the case. We have no idea what the latent features (i.e. axes in latent space) mean. They are just 24 numbers which somehow represent our data. The network decides what the best features are to learn based on the training data and other constraints that we set (which I mentioned in the intro). And ultimately it learns how to transform from 1024D input features to 24D latent features.
Here is that demo again http://vdumoulin.github.io/morphing_faces/online_demo.html
(This face demo actually use a Variational Auto-Encoder (VAE), which is a bit more complex than a vanilla Auto-Encoder which I describe above. VAE learns and approximates parametric probability distributions, as opposed to a simple deterministic mapping, thus is able to learn more expressive generative models. But the gist of the idea is the same).
Play with the sliders. Each slider is simply the value of the corresponding neuron in the middle ‘bottleneck’ layer, i.e. the 3rd slider is the 3rd neuron, the 3rd latent feature. (NB there are 28 sliders not 24 in this case. The author simply chose to have 28 neurons in the middle ‘bottleneck’ layer. Again, this is an arbitrary number that comes through trial and error).
How are the sliders mapped to the faces? What do they represent? And why? No one knows. The network decided (i.e. learnt) what those sliders (i.e. latent features) mean, and how to map individual pixels to the sliders in a non-linear fashion. Note that every single slider (i.e. every single latent feature) seems to affect every single pixel in some way. Also note that a human might have tried to parameterise a face with more human-readable parameters like ‘face roundness’, ‘nose size’, ‘distance between eyes’ etc. But in this demo each slider does something across the whole face which we can’t really define. Nothing is isolated. The network decided how to map the sliders (latent features) to the pixels in the most ‘optimal’ way according to the maths.
It’s worth noting though, that it is possible for us to ‘encourage’ this mapping with certain constraints: we choose the training data, we design the network architecture, we design the cost function etc. We do have control over how the network learns, but it’s not trivial to predict exactly what it will learn. And the field of trying to understand this mapping, the ‘internals’ of the network is a very hot area of research. E.g. see the work of the Evolving AI lab.
How are the faces generated? That’s the decoder. I.e. the right side of the network I showed above. Once we trained the auto-encoder, the network learns the weights for all of the neuron connections. Then we take the right side of the network only and play with the neuron values in the middle (the bottleneck), and push those through the network to the end till it outputs 1024 numbers — the pixel representation of the image. I.e. it transforms from 28D latent space to 1024D pixel space.
Each pixel in the final image is a rather complicated non-linear function of the 28 numbers in the bottleneck layer, i.e. the 28D latent space. The maths of how the network learns that function isn’t what this talk is about. That’s pretty much standard neural network, back-propagation, gradient descent (with a bunch of bayesian probability for the VAE). Again what I’m hoping drive home is the concept :
Taking raw data, which is very high dimensional (e.g. 1024D) and somehow, using some algorithm (of which there are many) learning a more manageable, and meaningful representation in lower dimensions (e.g. 28D)
And I’d also like you to remember, that when you’re moving a slider, you’re moving a point in space. In this particular case, we’re dealing with images of faces. So every point in that 28D space, represents a face. Each slider corresponds to an axis, and by moving that slider, you’re sliding a point along that axis.
Playing with those sliders, you’re exploring a 28D latent space, which is being transformed by the decoder into a 1024D pixel space, so that our human eyes and brains can make sense of it.
8. Geometric operations on faces in DCGAN space
Here is another example (doesn’t always load unfortunately). https://carpedm20.github.io/faces/
This example uses 100D latent space (as opposed to the 24D and 28D of the previous examples).
This is using a Deep Convolutional Generative Adverserial Network (DCGAN). At the high level, the concept is similar to everything I’ve just said, but different implementation details. Here the network architecture is designed specifically to process images (with spatially aware 2D convolution filters — inspired by our own visual cortex, or rather, those of cats), and there’s actually two networks learning together. A ‘generator’ network is generating images (similar to the previous examples), and a ‘discriminator’ network is learning to judge the ‘generator’s images and decide whether they’re ‘any good’ or not. Eventually the ‘generator’ network learns to generate ‘better’ images.
Again, my emphasis is not on how the algorithm does what it does, but rather what it can do, and what that means.
8.1 Operations in DCGAN Latent Space
This architecture allows a much more complex transformation to be learnt. There are constraints in place which force the network to learn a more ‘meaningful’ latent space.
Finally, I can ask again, if we have points in this latent space, what happens when we perform geometric operations on them?
NB. latent space is 100D, images are 64 x 64 = 4096D (for simplicity sake I’m ignoring color and assuming monochrome)
3x images of ‘man with glasses’ can be seen in the top left of this slide. Each of these 3 images is a point in 4096D pixel space. We can average the 3 points in 4096D pixel space (using simple vector arithmetic operating on the 4096D vectors, i.e. pixels, of each image). This will give us a new 4096D vector, i.e. 4096 pixels, a 64 x 64 image. As I said before this would effectively just do a simple ‘image mix’, as we would get in Photoshop. Not very interesting, and you can see the resulting image above in the 5th row from the top (2nd from bottom). You can see they’re just 3 images overlaid with each other. That’s the result of mixing images in pixel space. I.e. mixing pixel representations of images.
But we can also average the 3 points in this 100D latent space (again using the same vector arithmetic but operating on the 100D vectors, the latent representations, of each image). This will give us a new 100D vector, a new location in the 100D latent space. These 100 numbers mean nothing to us, but we can feed them into our network, and it will transform the 100D latent features back to 4096D pixel space, using the function that it learnt during training. This will give us a new 64 x 64 image. You can see this image above, it’s the 4th row from the top (2nd from bottom). They’re not perfect, but definitely much better than a simple pixel mix.
The network has learnt a non-linear blend/morph/warp mapping from 100D latent space to 4096 pixel space.
So let’s forget about doing anything in 4096D pixel space, and only operate in 100D latent space. We find the mid-point of 3x ‘man with glasses’ points, 3x ‘man without glasses’ points and 3x ‘woman without glasses’ points. In our 100D latent space we have 3x new points, each a latent representation of an image:
- 100D vector for avg man with glasses
- 100D vector for avg man without glasses
- 100D vector for avg woman without glasses
If we subtract the point ‘man without glasses’ from ‘man with glasses’, we get a (100D) offset vector, that is basically in the direction of ‘glasses’ (this ‘direction’ is still in 100D latent space). If we add that 100D offset vector to the 100D point that represents ‘woman without glasses’, we will move to a new location in our 100D space. If we were to transform that point in 100D latent space, back to 4096D pixel space (pushing it through the generator network), the network will output a 4096D vector, a pixel representation of a 64 x 64 image. Which happens to be of a woman with glasses.
And most importantly: This is not an image of a ‘woman with glasses’ from the training set. This is a brand new image which the network generated. Using the non-linear function that it’s learnt, to map from 100D latent space to 4096D pixel space.
What Radford et al have also done is, they’ve actually picked 8 random locations near that new point (in 100D). And transformed all of those 100D points back to 4096D pixel space (i.e. fed each of those 100 coordinates, the 100D latent features, through the generator network). And as you can see, they’re all variations on ‘women with glasses’. Basically in this space, points which are close to each other, are semantically similar. That’s why this space is also sometimes called semantic space, and why I keep referring to it as a more ‘meaningful’ space.
And it’s worth reiterating…
Once the network is trained, all that we are doing is moving or manipulating points in 100D latent space. We don’t need to think about morphing, blending or warping images or pixels. The network has learnt how to do that and has abstracted it away from us. We just deal with 100 latent features. Once we’re done manipulating points in 100D latent space, the network transforms the 100D latent representation back into pixel space so that they mean something to us.
And again, we don’t even know what these 100 latent features are. The network has decided — i.e. learnt — how best to map the high dimensional input features (i.e. pixels), to the lower dimensional latent features. We have no direct control over this. (But we do have indirect control, which I mentioned a bit in the introduction).
Here is another example, this time with smiling. Subtracting ‘avg neutral woman’ from ‘avg smiling woman’ in this 100D latent space gives us an offset vector in 100D in the direction of ‘smiling’. If we add that to ‘avg neutral man’, we have a new point in 100D which represents a smiling man. The network transforms that 100D point into 4096D pixel space, and we have an image of a smiling man. I believe this twitter bot is using something very similar.
And we didn’t teach the network what smiling is. The network learnt this — and god knows what else — from the training data.
9. Latent Anything
So I talked about using PCA and Eigenfaces to transform faces from 1024D pixel space to a 24D latent space, using simple linear transformations. I also talked about auto-encoders (and VAEs) to do something similar, but with non-linear transformations and much better results. We could play with sliders manipulating values in lower dimensional latent space, and then transform back to higher dimensional pixel space to see the results. Likewise with DCGAN, with very interesting results when we perform vector operations in lower dimensional latent space and then transform back into pixel space. Exact details of how these algorithms learn the transformations are not my priority here. The code is all linked above and there’s plenty of detailed tutorials around.
The most important point I’d like to underline, is that this concept isn’t limited to images of faces. Or even images at all. It can be applied to anything, any kind of data — as long as you can find lots of training data on whatever it is you want to work on, and you can find a way of formulating the problem of transforming that data from very high dimensional feature space to some kind of lower dimensional semantic latent space. That’s the challenge.
- collect training data
- ???? (high D input space -> lower, but still high!) D latent space
10. Word Embeddings
If we want to feed words into a neural network, how do we do that? We can feed them in character by character, like the very popular character RNN (Recurrent Neural Network) models. But even though these are very popular and successful for fun applications, they’re rarely used in production for Natural Language Processing (NLP). For NLP what gives better results is feeding in and operating on whole words, because whole words (and phrases) have meanings, not characters. But we have tens of thousands of words in the English language. Even though most common words are limited to somewhere around 5–20K, if we want to be comprehensive and include words which are rare, niche, scientific, names, places, phrases, websites etc it can go into hundreds of thousands. In fact a well-known Google News model by Mikolov et al 2013 which trained on a corpus of 100 billion words from Google News, extracted a vocabulary of 3 Million ‘words’.
So what are we going to do, operate directly on 3M discrete inputs? That would not be very efficient. Instead we set an algorithm to learn an embedding. We get it to learn to represent each word as a vector in an arbitrary dimensional space. We pick a number of dimensions, e.g. 100, 200, 500. Generally the bigger the better — but as usual more dimensions take longer to train, require more memory, more time, more computation etc. But we also don’t want too many dimensions, as that defeats the point of having an embedding in the first place. Just like in the auto-encoder, we want to force the network to squeeze the information, just enough to make it to find regularities in the data and learn structure.
E.g. if we pick 300 dimensions, then each word becomes a 300D vector, represented by 300 numbers. Each word becomes a point in a 300D space.
This idea of mapping words to a high dimensional (~100–500) vector is not new. The idea dates back decades (Bengio et al 2003), as do a lot of the algorithms we’re using today. But in the recent years there have been many improvements in the way an algorithm learns to map a word to a vector. Some popular examples are Word2Vec 2013, and GloVe 2014.
Again, the details of how this mapping works is not within the scope of this talk, I just want to convey the concepts. But to give a rough overview, the basic premise is that words appear in specific ‘contexts’ (i.e. set of neighbouring words). The learning algorithm goes through a big chunk of text (ideally billions of words), and looks at the neighbours (i.e. context) of each of the words. The algorithm learns to either predict the neighbours of a given word (this is a skip-gram model), or learns to predict a word based on a given set of neighbours (this is a Continuous Bag Of Words model, CBOW). E.g. the word ‘milk’ is often found near the words ‘cow’, ‘drink’, ‘bottle’, ‘breast’, ‘cheese’, ‘coffee’, ‘pint’, ‘white’, ‘lactose’, ‘cream’, ‘cereal’, ‘corn flakes’ etc. And each of those words are associated with another bunch of words (i.e. contexts), some of which are also associated with ‘milk’,but others are not.
As the learning algorithm goes through a massive (billions of words) corpus of text, and does this analysis for every single word, it builds quite complicated relationships between words. It may learn that the words ‘king’, ‘queen’, ‘throne’, ‘crown’ are all somehow related. But that ‘king’ and ‘man’ are also related, in a different way to the way ‘king’ and ‘queen’ are related. It may even hopefully learn that ‘king’ and ‘man’ are related in a similar way to how ‘queen’ and ‘woman’ are related.
At the end of this embedding training, the words in our ‘vocabulary’ have vectors assigned to them (in an arbitrary high dimension that we decided prior to the training). And if the training was successful, and if it was a good algorithm, then the vectors assigned to the words are not random or arbitrary, but meaningful and somehow capture these relationships between words. So that words which are related in meaning, are close to each other in the embedding space. But of course this is a high dimensional space. E.g. 300D. There are infinitely many different directions, for different kinds of relationships and meaning. The words ‘king’ and ‘queen’ might be clustered together and very close in one direction, (along with other words related to royalty, kingdoms, authority) etc. But might be further apart in other directions. While the word ‘man’ is aligned to ‘king’, and ‘woman’ aligned to ‘queen’ in another direction.
In fact, with certain algorithms (e.g. Mikolov et al’s word2vec) there is a direction for ‘gender’. i.e. the vector between ‘king’ and ‘queen’, is aligned with the vector between ‘man’ and ‘woman’, and ‘uncle’ and ‘aunt’, and ‘husband’ and ‘wife’ etc. So we can actually do geometric operations on words. Such as:
vec(‘king’) - vec(‘queen’) + vec(‘woman’) will result in a new 300D vector. If we look for the nearest data point there, it is most likely to be the vector for ‘man’. (This is also notated as “queen : king :: woman : ?”. Read as “queen is to king, as woman is to ?” and the network returns ‘? = man’).
Slight digression: Actually the nearest point is often one of the input words, e.g. in this case ‘king’, ‘queen’ or ‘woman’. So usually we deliberately ignore those words and choose the next nearest word which is not part of the operation. This is an important detail, because there is research such as this which looks for learnt biases in word embeddings. The study claims that a model such as Mikolov et al’s word2vec trained on Google News, allegedly learns biases such as returning “nurse” instead of “doctor” for the query “father is to doctor, as mother is to ?”. This of course demonstrates an incredible gender bias in the model, which it apparently learnt from the Google News dataset. But in truth word2vec does return “doctor” for that query. “Nurse” is 2nd the list. In fact the top 5 results are (with similarity): [(‘doctor’, 0.88192165), (‘nurse’, 0.7165823), (‘doctors’, 0.68008757), (‘physician’, 0.66655892), (‘midwife’, 0.58885992)]. The authors of the study are simply reporting incorrect information. They probably didn’t realise that the input words are suppressed from the output in the software that they used. I.e. the gender bias they point out isn’t a learnt gender bias in the model, which the model learnt from the training data. The bias they point out is their user error in whatever software they used to interact with the model. Perhaps a bias on their eagerness to find bias in the model. There is of course bias in word embeddings (or any model for that matter), but this example is not one of them. NB. inspired by the same authors’ previous paper I made a twitter bot that explores gender bias in word embeddings https://twitter.com/wordofmathbias
In this word embedding space there are also directions for verb tenses, plural suffixes, country-capital relationships and many more. The images below are projections from 300D space to 2D.
10.1 t-Distributed Stochastic Neighbor Embedding
Usually these embeddings are in very high dimensions (e.g. 300D) so it’s very hard to visualise them. A popular method for reducing dimensions for visualisation is called t-Distributed Stochastic Neighbour Embedding (t-SNE). Unlike PCA, t-SNE is non-linear, and its aim is usually not to produce latent features — as we did with PCA, or auto-encoders, or DCGAN — but to help us visualise high dimensional data in 2D or 3D. It ‘unwraps’ the high dimensional data into 2D or 3D trying to maintain clustering and relationships as much as possible, and often does a better job of that than PCA. It’s kind of like trying to unwrap the spherical surface of the Earth from 3D into a 2D map and maintain country neighbourhood relationships.
You can browse a 2D t-SNE of word2vec embeddings on the following site, and see how the words are clustered.
Bear in mind, we are reducing dimensions, in this case from 300D to 2D, so we are losing a lot of information and relationships in various high dimensions. The same way we lose information on how Japan and US relate to each other geographically, on a 2D map of the world centred on Europe.
11. Visual-Semantic Embedding (Alignment)
Another interesting aspect about these transformations into latent spaces, is that various constraints can be put upon the training, e.g. to enforce the sharing of spaces.
The auto-encoder and DCGAN we saw earlier transformed an image from very high dimensional pixel space, into a lower dimensional latent space, 28D and 100D respectively. A simplification of what the networks do, is they encode an image, compress it down to much lower dimensional latent representation (e.g. 28D or 100D). Then they decompress that latent representation, decode it, back to a high dimensional pixel representation. And the objective of the training, is to learn encoding & decoding functions, such that for all training images, the reconstructed image (i.e. encoded and then decoded) matches the original training image. (In reality the objective in both models are bit more complicated, but that’s the gist of it).
If we have a training dataset which consists of image-caption pairs (i.e. images with associated sequences of words) we can do something quite interesting. We can impose constraints during training (via the architecture and objective function) to condition an image encoder (which transforms the pixels of an image into a latent representation) and a caption encoder (which transform the words of a caption into a latent representation) to encode into the same space. I.e. to enforce
a single latent space shared between captions and images, whereby the image encoder learns a function to transform the pixels of an image to a latent representation which matches the caption encoder’s latent representation of the associated caption.
Similarly we could train decoders, to transform from this shared latent space, back to pixel space (i.e. an image decoder, similar to the face auto-encoder or DCGAN), or back to sequences of words (i.e. a text decoder). This would mean we could take any image in pixel space, transform it into this shared latent space using an image encoder, and transform that latent representation back to text using a text decoder, i.e. to generate a caption. Or we could go the other way around and feed in text, and generate images.
These processes can be summarised as:
image -> [image encoder] -> shared latent space -> [text decoder] -> text
text -> [text encoder] -> shared latent space -> [image decoder] -> image
This associating of image-caption pairs into a single shared embedding space is called Visual-Semantic Embedding (or Alignment). A well-known example of this idea for captioning images is NeuralTalk2. Or to generate images from text (and a more recent example).
In summary, the caption encoder has learnt a function that processes a sequence of words (i.e. a caption), and transforms (i.e. encodes) that to a latent representation. The image encoder has learnt a function that processes the pixels of an image, and transforms (i.e. encodes) that to a latent representation. They’ve been trained in tandem and conditioned such that for a given image-caption pair from the training data, the image encoder outputs the same latent vector as the caption encoder.
These latent representations mean nothing to us humans, the same way the 100D latent representations of the DCGAN latent space meant nothing to us. But again, the same way the DCGAN network could run another function (i.e. the decoder) on the 100D latent vector to generate an image (i.e. to transform a point in 100D latent space to a point in pixel space), we can train an image decoder to learn to transform our latent representation back to pixel space. Likewise with a text decoder.
This becomes more powerful as the word embeddings are able to better capture and reflect semantic relationships between words (e.g. word2vec or GloVe), as opposed to arbitrary embeddings or sequences of characters. Likewise using convolutional architectures to process images, so that images aren’t treated as arbitrary sequences of pixel values, but hierarchies of semantically related groups of pixels. I.e. relationships between the content of the images, are somehow reflected in their embeddings.
12. Thought Vectors
An insane example which takes this idea even further, in an almost unimaginable way, is ‘thought vectors’ — as coined by Geoff Hinton. The premise is, the same way we can learn to embed words as vectors in a high dimensional space (e.g. 300–500D), where the word vectors have some kind of semantic relationship in that space; we could do the same to entire phrases or sentences or ‘thoughts’. I.e.
We can embed an entire thought or sentence — including actions, verbs, subjects, adjectives, adverbs etc. — as a single point (i.e. vector) in a high dimensional space, semantically related to neighbouring points by proximity or direction; similar to how sentences and thoughts relate to each other, linked by a chain of reasoning.
I find this idea totally insane (-ly powerful).
13. Neural Story Teller
You give the system an image, and it writes a story to accompany the image, in the style of a romance novel. The text above was generated by this software and Samim made lots of great examples with it.
Without going into too much technical detail, I’ll briefly explain how this works in context of spaces and transformations.
13.1 Training data
There’s two independent, completely unrelated training datasets:
- A large corpus of romance novels (dataset here)
- Loads of images with associated captions (dataset here)
And there’s a number of components to the system.
13.2 Skip-thoughts Encoder
A skip-thoughts encoder is trained to transform an arbitrary length sequence of words (i.e. a sentence) to a fixed dimensional latent representation in a thought embedding space.
This is conceptually similar to the word embeddings mentioned above, except it operates on sequences of words, as opposed to single words. For this example Jamie chose 2000 dimensions for the thought embedding space.
13.3 Romance Novel Decoder
A romance novel decoder learns how to decode the 2000D latent representations in thought embedding space, back into a sequence of words, in the style of the romance novels.
This is basically the decoder half of an auto-encoder (similar to the face auto-encoder example), where the encoder is the skip-thoughts encoder. I.e. we use the skip-thoughts encoder to go through the romance novels and encode every sentence as a 2000D thought vector, so that each sentence in the romance novels is encoded to a single point in a 2000D thought embedding space. Then the decoder has to learn a function that can transform each of those 2000D representations back to the original sequence of words that they came from.
Using the auto-encoder notation:
x (original word sequence) -> [skip-thoughts encoder] -> z (2000D)
z (2000D) -> [romance novel decoder] -> y (reconstructed word sequence)
The objective for the training of the romance novel decoder (i.e. z -> y), is that for each sentence, y should match x as closely as possible.
(FYI. If you’re wondering how ‘sequences’ of variable length can be processed in these contexts, as opposed to say images with fixed number of pixels, the answer is in RNNs, particularly LSTM RNNs, similar to char-rnn, but operating on a word level, not character level).
13.4 Images & Captions: Visual Semantic Embedding
We also train a Visual Semantic Embedding (as described above) on the image-caption training dataset. After this training, we can feed an image into this model, and retrieve a caption. But the caption we retrieve is in the style of the captions of the training data, not in the style of the romance novels.
13.5 “Style Transfer”
So we apply the same skip-thoughts encoder to the retrieved caption, to transform it into the same 2000D thought embedding space as the romance novels ‘thoughts’. However, the thought vectors from the romance novels occupy a different location of the thought embedding space to the thought vectors from the image captions, and we currently have no way of correlating them.
So this gap needs to be bridged, just like we did for ‘king’ to ‘queen’, ‘man’ to ‘woman’, ‘image of man with glasses’ to ‘image of man without glasses’, to ‘image of woman with glasses’ etc. How do we bridge that gap? Turns out taking the difference in means of the points in each group i.e. the average, works quite well!
Hopefully diagrams should make this all a bit clearer.
13.6 The process with pictures
Here we are in a high dimensional thought embedding space, in this example 2000D. Each point in the image above represents a ‘thought’ from our training data. We have a cloud of points in one location (lower left), representing the captions from the image-caption training data. We also have a cloud of points elsewhere (upper right), representing sentences from the romance novels. Both sets of data were encoded with the same skip-thoughts encoder. I.e. the sentences in the romance novels were transformed into this latent space with the same transformation function as the captions, but they occupy separate locations in this space, since semantically we cannot relate them to each other (yet).
We feed an image into the Visual Semantic Embedding image encoder, and it gets transformed into a latent representation in another space, another universe — the Visual Semantic Embedding space. From that space we retrieve a caption, which is in the style of the image-captions, so it’s no good to us directly. We feed that caption into the skip-thoughts encoder to transform it into this space, the 2000D thought embedding space. Let’s call this 2000D vector the chosen caption thought vector, indicated by the red dot above. (Actually, to minimise noise, it helps to retrieve the closest ‘k’ captions, where ‘k’ is a tunable parameter, e.g. 100, and then average their positions, similar to how we averaged a bunch of ‘smiling woman’ or ‘man with glasses’ vectors in the DCGAN example).
We then look at the offset vector from the mean (i.e. centre) of all of the caption thought vectors, to the mean (i.e. centre) of all of the romance novels thought vectors. Jamie calls this a kind of style vector.
Then we can offset our caption thought vector by this style vector, transforming it in the direction of the romance novels thought vectors. We’re effectively changing the style of the thought. i.e. keep the ‘thought’ of the caption, but transform the style from caption-style to romance novel-style.
After doing this ‘style transfer’, the transformed vector (still in 2000D latent space) represents a romance passage in this embedding space. But it’s still just a point in 2000D thought embedding space, meaningless to us humans. We need to transform that back into a sequence of words which we can read. Now we can use the romance novel decoder, which was trained to decode 2000D latent representations of romance novel style text, back into sequences of words that we can read and understand.
The whole process can be summarised by the following journey through multiple dimensions and transformations in space
It blows my mind that this even remotely works. I hope you see why I spent the entire hour talking about SPACE. I’ve completely glossed over the maths and details of the algorithms which allow this to happen. But before going into the maths and those details, I think it’s essential — or at least very helpful — to understand these concepts, and visualise these processes in this way.
In fact I’d like to demonstrate how to think about a few other popular deep learning algorithms in this way. Often these algorithms were not explicitly designed or programmed in this way, but probably with more direct, low-level, mathematical motivations. But I still prefer to always keep this high-level angle in mind too. (NB. There are some subtle — but often crucial — details which I’m omitting below, but the gist of it is there).
Alexander Mordvintsev, Christopher Olah and Mike Tyka.
- We start with an input image, a single point in pixel space.
- That point is transformed by a convolutional neural network to a point in a latent space (the latent space of the layer that we’re wanting to maximise).
- We push the point in latent space in a direction depending on what we want to do: push away from the origin if we want to amplify the activation of the whole layer (original deepdream), push away from the origin along a specific axis if we want to amplify activation of a particular neuron (variation of deepdream), feed in another ‘guide’ image and push the point in latent space towards the guide image’s point in latent space.
- We then transform that point back into input space (pixel space). i.e. we run the network backwards.
- NB. When we’re going backwards from lower dimension latent space to higher dimension input space this way, there are many possible routes that can be taken. We are basically asking ‘what point in input space, maps to this point in latent space’, and it’s plausible that there are more than one. We simply find one of those points. This can be considered analogous to having a 3D object cast a 2D shadow, then we manipulate a point in the 2D shadow and try to construct a new 3D object which would have the desired 2D shadow. The solution is not unique.
14.2 Style Transfer
Leon A. Gatys, Alexander S. Ecker and Matthias Bethge
- We start with two input images: a style image, and a content image. This gives us two points in input space (i.e. pixel space).
- The content image is transformed by a convolutional neural network into a latent space (i.e. one of the higher level convolution filter layers of the network). Let’s call this point C in latent space Lc.
- The style image is transformed by the same convolutional neural network into many different latent spaces (i.e. some of all of the convolution filter layers of the network) and the correlations between those points is calculated (technically a Gram matrix). This is called the Style representation of the image, and is a new point in a new latent space. Let’s call this point S in latent space Ls.
- Remember that point C is the content image transformed into a latent space Lc, and that point S is the style image transformed into a latent space Ls.
- We run the network backwards, to find a point in input space (pixel space), such that when that point is transformed into Ls it is as close to point S as possible (has that style), and when that same point is transformed into Lc it is as close to point C as possible (has that content). There are additional weighting factors to control the ‘pull’ factor of points C and S.
- Carrying on the 3D object with 2D shadow metaphor, this can be thought of as two 3D objects (a StyleObject and ContentObject ) casting two different 2D shadows from two different light sources (a StyleLight and a ContentLight). We then try to construct a new 3D object such that its shadow from the StyleLight looks like the shadow of the StyleObject (from the StyleLight), and its shadow from the ContentLight looks like shadow of the ContentObject (from the ContentLight). Again there is no unique solution.
- Interestingly, Jamie Ryan Kiros (author of Skip-Thought Vectors and Neural Story Teller) quotes this research as inspiration for Neural Story Teller, even though the underlying implementations are completely different, the high-level space-dimension-transformation angle is conceptually similar.
To summarise, the points I hope you’ll get from this talk are:
- A vector is simply a list of numbers, which is also a point in space. (The number of numbers determines the dimensions: An n-dimensional vector is a list of n numbers, which is a point in n-dimensional space. e.g. List of 3 numbers is a 3D vector, which is a point in 3D space.
List of 1000 numbers is a 1000D vector, which is a point in 1000D space).
- Any piece of complex data, whether it’s an image, a word, sentence, sound, gesture, molecular structure, gene, state of a Go board, input from an autonomous car’s sensors etc can be thought of as a high dimensional vector, a single point in a very high dimensional space.
- Mentally visualising high dimensional spaces is not possible. Mentally visualise in 2D or 3D, think n-D. (Use t-SNE or PCA to reduce to 2D or 3D to visualise more accurately).
- Machine Learning involves learning a function (or a series of functions, or procedures) which maps an input of any dimensions to an output of any dimensions.
- This can be thought of as transforming a point in one space, to another point in another space, quite often changing dimensions. i.e. transforming from high dimensional space to low dimensional space, or from low dimensional space to high (or from one transformation in space to another transformation in the same-D space).
- Manifolds are a generalisation of ‘shapes’ or ‘surfaces’ in any number of dimensions. E.g. the surface of a crumpled piece of paper is a 2D manifold embedded in a 3D space.
- So we can think of Machine Learning, as learning a function to transform a point from 3D coordinates of space to the 2D coordinates on the surface of a crumpled piece of paper — or the other way around — or from one point in space to another (but in any number of dimensions, not limited to 2D or 3D).
- Deep Learning is about learning many functions which are composited. The output of one function feeds into another function which feeds into another function. We don’t need to visualise Deep Learning architectures as these huge networks with millions of nodes, we can just think of them as sequences of high-dimensional functions. And the algorithm tries to learn each of these functions.
- Each function (i.e. each layer), is a transformation from one space to another. A transformation between spaces and manifolds.
- We control the learning process with the constraints that we set: the training data we provide and optionally pre-process; the architecture we design; the hyper-parameters we pick; and the cost function we declare — how we define what the objective of the learning should be, and how we value ‘similarity’ or ‘success’.
- The ‘hidden’ layers of a network are ‘latent’ spaces. They may or may not mean anything to us, but they are spaces which may (or may not) be ‘meaningful’ to operate in. These are spaces in which we can add or subtract words or images, manipulate points and we might get interesting, ‘meaningful’ results.
- But it’s not the latent spaces themselves which are ‘important’, it’s the functions which transform between the latent spaces and input/output spaces that are important. The input/output spaces are the ones we actually care about, the ones which we interact with in the ‘real’ world: pixels, words, labels, sounds etc. That’s the data that we have and feed into the network, and ultimately what we want the network to produce.
- When the network ‘learns’, it learns those functions. It learns the functions to transform from one space to another. It learns how to transform from pixel space to a latent space in which we can add or subtract latent representations of images from each other and get meaningful results, like a latent representation of a ‘smile’ direction. The network also learns how to transform a point in that latent space back into pixel space, so that we can see the smiling image, and it means something to us. Or we can manually edit 20 numbers, move a point in 20D latent space, and the network can transform that point into pixel space so we can the face that we just created. Or it learns how to transform words as points in 300D space, so that we can perform mathematical operations on the points to shift gender, or country. Or it learns how to transform an image into 2000 latent features, a point in 2000D latent space, and then transform that latent point into another space to generate text, a caption for the image. etc.
- All of these operations we perform in Deep Learning are essentially a journey through multiple dimensions and transformations in space, leaping from manifold to manifold.
A very inexhaustive list of related links
Spaces, dimensions, manifolds, PCA etc.
https://github.com/memo/ofxMSAWord2Vec (papers and more links here)
Thought Vectors / Neural Storyteller
Visual Semantic Alignment