Transformation in neural networks
- neural networks as general nonlinear transformers, and
- backpropagation as a way to calculate the gradients necessary to drive gradient descent learning.
In this post I’ll explore transformation via neural nets from a conceptual, intuitive perspective. If I have time I’ll cover backprop in a second post.
A simple linear transformation
If you’ve studied linear algebra, you know that you can use it as a framework for working with linear transformations:
- Matrices encode linear transformations.
- Vectors encode individual points (both pre- and post-transform).
- Matrices can also encode sets of points.
- Apply a linear transformation to a point or set of points using matrix multiplication.
- Compose linear transformations using matrix multiplication.
It turns out that neural networks offer another framework for doing the same thing (and more too). Let’s say that we want to implement a transformation that takes a vector (x, y) and produces (2x, y). In linear algebra we’d do something like this (using (3, 1) as an example):
In a neural net, we’d do this:
In the neural network diagram above, each output unit produces the linear combination of the inputs and the connection weights, which is the same thing we do with matrix multiplication.
But this is pretty boring, in a couple of ways. First, when we say linear transformation here, we’re talking about linear transformations in the strict sense. You know the equation y = 3x + 4? In school we called it a linear equation, but (surprise) the transformation involved isn’t strictly speaking a linear transformation, because it has a translation (+4), and linear transformations don’t do that. Linear transformations always carry 0 in the source space to 0 in the target space. But it feels like we should be able to model translations.
In mathematical language, we’d like handle affine transformations, which are basically linear transformations with translations allowed.
The other thing that’s boring about our neural net example above is that most of the phenomena we’d want to model involve nonlinear relationships. But we can’t squeeze nonlinearity out of linear transformations.
Reverting back to math-speak, we’d say that linear transformations are closed under composition. No matter how many matrices we multiply together, we end up with a matrix, and that’s a linear transformation.
Let’s look at how we can tweak our neural net to solve these problems. We’ll start with translations.
Introducing translations using bias inputs
Let’s continue with the examples we used above. With linear algebra, we usually handle affine transformations using vector addition:
(Incidentally, there’s another way to do this using augmented matrices, but using vector addition is more typical.)
How might we do that with a neural network? The trick is something called bias inputs:
Like before, each output unit performs a linear combination of the incoming weights and inputs. This time though, the units have a constant bias input, which each output unit can weight independently to achieve the effect of a translation vector. In this case we use the weight 0 for the first output unit to zero out the bias. We use the weight 4 for the second output unit to scale the bias accordingly.
By the way, these are called bias inputs because the weighted bias input determines the baseline value for the unit. You can think of it as being like the b term in the standard y = mx + b equation.
So if you wondered what bias inputs are for, now you know. Let’s turn now to true nonlinearity.
Introducing nonlinearity using activation functions
Now we have affine transformations, but we’re never going to build Terminator robots if we stop there. We need some way to introduce real nonlinearity.
Enter the activation function. The idea is that we can endow each non-input unit with a bit of post-processing, running the linear combination through a nonlinear activation function and treating that as the unit output. It looks like this:
There are a whole bunch of different options available for which activation function you use. It used to be that you just used a sigmoid function by default, but nowadays people use different activation functions for different purposes. We won’t get into all those details here — we’ll just note that using a nonlinear activation function is what allows us to escape linearity.
Before we close, let’s look at one more thing: multilayer architectures.
It turns out that to do anything interesting, we want to have at least one so-called hidden layer of units sitting between the input and output layers. This makes it possible to learn functions that are otherwise impossible to learn, the classic example being XOR.
Anyway, here’s what a feedforward network with at least one hidden layer (such a network is also known as a multilayer perceptron) looks like:
Under certain fairly general assumptions, having a single hidden layer turns out to be enough to turn our feedforward network into a universal approximator. (Here’s another link.) We don’t even need to have nonlinear activation functions on the output units. In the diagram above, I removed them.
With deep learning, the idea is to have lots of hidden layers. From a theoretical perspective they aren’t required, but it turns out that from a practical perspective, they make all the difference. Each successive layer provides support for representations more abstract than those that live at the previous layer. So for example your input layer might have a bunch of pixels. Then your next layer represents edges, and the next represents shapes, and so on. By the time you get to the final layers you might have representations for things like cat-ness or Volkswagen-hood.
Anyway we’re getting ahead of ourselves here, so I’ll stop for now.