This story is a summary of my intuition about the Deep learning book (Ch2) by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Vectors and Tensors
I will start by these two crucial blocks of any system, I am skipping matrices and scalar values as I think they are pretty obvious to many readers.
An array of numbers, we can think of as a set of coordinates having each values in a different axis for example x = [1,2], x is a vector having values in two different axes.
The reason why I began with vectors is actually to compare them with tensors as this comparison has always made me pretty confused. A tensor is an array having more than 2 values for example x = (1,2,3,..)
One of the main operations in deep learning is matrix multiplication. For example let’s consider an example of input data [0,1,2] and a dense layer of weight vector transposed([0.1, 0.2, 0.3]) (note that I added transposed as I needed to convert my row vector to column vector to be able to perform matrix multiplication with it)
This operation translates to A = XB and is performed as follows:
0*0.1 + 1*0.2 + 0.3*2 = value, this is called dot product between the row vector of our data and column vector of the weight. This what really happens when you add dense layer to your network. This differs from what is called hadamard product(element wise product), this is performed as follows:
result = [0*0.1,1*0.2,0.3*2] = [0,0.2,0.6]
For more info about the difference between matrix multiplication and hadamard product, link
Matrix multiplication is not commutative,
AB != BA
However dot product between vectors is commutative,
transpose(x) * y = transpose(y)*x
What are norms?
Sometimes we need to measure the size of a given vector so we simply use norm function to do so, norm functions are simply mapping vectors to non negative values as they are the distance from the origin pt and vector X.
One of the most used norms in machine learning is L2 norm which is called Euclidean norm and is simply computed as x * transposed(x)
Another norm function is L1 norm which is used when the difference between zeros and non zeros elements within the vector is very important to consider since it puts a big weight when values changes from zero to non zero. For example, if a value in a vector changes from 0 to 10, L1 increases by 10
Another norm function is the max norm which is just the max element in a vector.
Special kinds of matrices and vectors
A = transposed(A)
Are vectors having norms > 0 and are 90 degree to each other, this means that if vector x is orthogonal on vector y, transposed(x) * y = 0
From this comes orthogonal matrices which have vectors(columns) that are mutually orthogonal to each other. So if we have matrix A and B and they are orthogonal then, transposed(A)*B = A * transposed(B) = Identity matrix
Think of it as decomposing huge parts of objects to a tiny and more intuitive ones. The book actually gives a very good example about this: if we can represent the number 12 by 2*2*3, so what info do these tiny parts give us? Maybe that 12 is divisible by 2 and not by 5
The same idea applies to eigen decomposition where we need to divide matrices to vectors and eigen values, a simple equation to compute this for matrix A.
A*x = lambda*x
Let’s give an example to know where is this coming from.
Now we have a matrix A and we multiply it by a vector x, if you notice that the result is 3 times the vector, actually we can rewrite the whole operation as follows.
This tells us that 3 is actually an eigenvalue and vector [1,1,2] is an eigen vector, an eigenvector for matrix A is a vector that for some number lambda gives this equation
Ax = lambda x
There are an infinite number of eigenvectors but a finite number of an eigenvalues — each eigenvalue has it’s own set of eigen vector. We need to find eigenvalues to get eigenvectors
Singular value decomposition
Following the same idea, but instead the SVD is the product of 3 matrices U*E*transposed(V), for matrix A of size m*n, U is of size m*m, E is of size m*n and V is of size n*n where V is the eigenvector of transposed(A)* A and U is the eigenvector of A*transposed(A)
U = V as A*transposed(A) = transposed(A)*A
So now we have the 3 components of SVD and we can compute it easily!
Mapping matrices to a scalar value by computing summation of all eigenvectors of a given matrix
Simple machine learning algorithm
It is simply decoding huge data to smaller data, but how is it possible? We just represent a lower dimensional version of the same data so that it could be smaller. Sometimes we lose some important features in the data and that’s the catch here. To find a way to decode data with losing less info, PCA function is simply y(x) = transposed(D)* x where D is the decoder vector and x is our original data so an important question came to my mind right now.
How do we actually get this D vector, in plain English we get Di such that the L2 norm of xi and represented pt xi(the representation of xi in the lower dimension) is the minimum, I hear you how could I compute represented pt xi in the first place, will tell you
The main equation is,
D = min(sqrt(sum(xj-r(xj))²)), r(xj) is the represented pt
but in fact: r(x) = D*transposed(D)*x
so we can substitute in the first equation and get the D that achieves the requirements. This equation has a long implementation in the book but I am just giving the main intuition behind it.
I hope you liked this first part of a long coming series!