A Hackers Guide To Deep Learning’s Secret Sauces: Linear Algebra

10 min readNov 9, 2017

As deep learning witches and wizards, there are many secret sauces that we need to stir into our predictive potions. As a non-mathematician and headstrong programmer, I often develop headaches when reading deep learning papers and desperately search for the source code.

This series is about the journey of digging into the basics of deep learning, from the bottom up, to complement the knowledge we develop as hacker who is used to building projects with awesome libraries like TensorFlow or Keras.

One of the main sources of information for this series is The Deep Learning Book, by the heavyweights Ian Goodfellow, Yoshua Bengio and Aaron Courville. I have taken care to structure this article like the original text but there is code and more media, as well as differing explanations. I strongly urge you to check it out as it’s free and an amazing resource! I also dip into Christopher Bishops Pattern Recognition and Machine Learning and Linear Algebra and its Applications by David C. Lay, Stephen R. Lay and Judi J. McDonald.

Deep learning is built upon many mathematical foundations and one of these important mathematical frameworks is Linear Algebra. Linear Algebra objects, such as matrices and vectors are used to represent the inputs, outputs and weights of neural networks, with a few non-linearities sprinkled in the right places throughout the model.

There are fundamental objects used within Linear Algebra:

Scalars: Simply a single number. The type should be specified as they are introduced in the text, for example it could be a real scalar (floating point,) or a natural number scalar (integer.)
Vectors: A vector is a one dimensional list or array of numbers. If you’re familiar with matrices, (read on if you’re not,) then it is a matrix with a single column. The order matters in a vector, see Fig 1. for more details.
Matrices: These are two dimensional lists of numbers, which means we have to identify each element comprising the matrix with two indices. Consider them a list of lists.
Tensors: A multidimensional array comprising of more than two axes, this means there are n amounts of dimensions.

Fig 1. Element position matters in vectors (and matrices and tensors!)

In the code below we can see the objects implemented in Python using the NumPy module.

The NumPy library is a great option for optimised linear algebra computations.

So there are some basic objects — what can we do with them?

There are many matrix operations to learn about. An important operation to note is the transpose operation. Given a matrix A with N x M dimensions, the transpose of A would have M x N dimensions.

After making the above LaTeX equation, I realised we can also see this as a gif thanks to wikimedia:

The transpose of a matrix can be thought of a mirror image across the main diagonal (the blue line in the gif.)

A scalar can be considered as a tensor with a single value, so, because there is only one entry, the transpose of a scalar is itself!

Providing that the two tensors (or matrices/vectors) are the same shape — we can add or subtract the corresponding elements together. We can also add and subtract scalars to tensors, where by each element is summed together with the scalar. This is useful to keep in mind, as with neural networks we sum the bias values with the weights times inputs — so this will come in handy later!

Matrix addition. Every respective element is added to give us a new matrix.

If, on the other hand, we would like to multiply a scalar and a matrix then the result each respective element multiplied by the scalar. Of course we can see that the resultant matrix keeps the original matrices dimensionality.

Scalar matrix multiplication. Here each element is multiplied (scaled) by two.

How about multiplying matrices with matrices? The most intuitive version would be the element-wise product, or the Hadamard product, which is denoted as:

An example of the element-wise product would be:

Each respective element of the matrices are multiplied together to give us the product. Note the dimensions of the matrices must be the same.

We can also multiply matrices with slightly different dimensionalities. Given two matrices, A and B, where A has the same amount of rows as B’s columns, we can multiply matrix A with B. This gives us the matrix product as the matrix C. We perform this operation as so (for the non-mathematicians, the Greek E symbol is the notation for the sum over k elements):

A has k columns and B has k rows, and for each index of i and j for C we sum the multiplication A and B at the kth row or column.

The matrix product operation is very important to understand; with neural networks, inputs are passed to each layer (as a vector or matrix) and these are multiplied with the weights matrix. The code for the matrix product operation is essential to understand if you want to build neural networks without using high-level abstraction methods that do the matrix multiplication for you. Here is the corresponding NumPy code:

We should note that matrix product operations have some mathematical properties that are useful, such as being distributive. This means that given some matrices A, B and C, writing A(B + C) is the same as writing AB + AC. Matrix product operations are also associative; A(BC) is the same as (AB)C.

Identity matrices also spring up in the textbooks! These can be of any size and are structured so that they have ones on the main diagonal and zeros everywhere else.

The identity matrices go on and on in sizes. An interesting property of identity matrices is how it affects vectors when you multiply the two together:

From this we can see that the identity matrix doesn’t change any vector when we multiply that vector by its identity matrix. This holds true for n-dimensional vectors and we can see the code for this quickly here:

Another important building block for our foundational understanding of linear algebra is the inverse matrix. The matrix inverse of matrix A is denoted as follows, where I is the appropriately sized identity matrix for A.

Frequently in deep learning, it can be useful to measure the size of a vector. For example, sometimes in order to prevent overfitting, we apply regularisation, otherwise known as weight decay, to neural networks to penalise their weights for being too large. We also will need to measure how large our error vectors are to penalise our neural networks in our loss function!,

Functions called norms are used for measuring the size (or lengths) of vectors in machine learning. They typically map vectors to non-negative values and should return zero when the components of a vector are entirely zero. A well known norm, the L2 norm or otherwise known as the Euclidean norm is calculated as follows:

Euclidean norm of a vector is simply the sum of each component squared, which is then square rooted.

Another popular norm is the L1 norm. This is as simple as:

L1 just means for each component comprising the vector, sum the respective absolute values.

With L1, every time a component of X diverges from zero by an amount, the size of the L1 norm increases by the same proportionate amount.

Norms are super easy!

There are also special kinds of matrices and vectors that have uses throughout machine learning.

A vector whose length is one is called a unit vector. We can obtain the length by getting the L2 norm of the vector. We can divide a nonzero vector V by it’s length to obtain the unit vector. This process is known as vector normalisation.

In linear algebra a diagonal matrix is any matrix in which the entries that do not comprise the main diagonal have values of zero. The identity matrix is an example of a diagonal matrix, but we are not limited to values of one on the diagonal, or square shaped matrices.

These are all examples of diagonal matrices, as elements all equal zero where indices i is not j.

A symmetric matrix is a square matrix that is equal to its transpose. We can see that if we took the transpose of any of the below matrices, we would get the same matrix back!

We also get orthogonal vectors, these are where the vectors are at right angles from one another. If we take the dot product of the two vectors, then an angle of zero will be returned.

The output of the matplotlib program. We can see that the vectors are clearly at a right angle.

If the orthogonal vectors are not just orthogonal to one another, but they are also have unit norm then we have a special name for them; orthonormal.

Orthogonal matrices are useful in deep learning; they can be used to initialise weights of neural networks to help prevent problems called exploding gradients and vanishing gradients. These problems can arise in deep recurrent neural networks where there are many successive matrix multiplications. Orthogonal matrices are square matrices who return an identity matrix when you multiply them with the transpose of themselves. Here we see the relationship of the identity matrix I and the orthogonal matrix A.

This tells us that the transpose of an orthogonal matrix is equal to its inverse matrix.

A code equivalent of the above orthogonal matrix maths could be:

Understanding this knowledge, we move to eigendecomposition, where we decompose a matrix into a set of eigenvectors and eigenvalues. Eigenvalues are everywhere!

Backing up a little, in real life we encounter matrices and vectors frequently; across a broad interdisciplinary range of subjects the objects we study and the things we can do with these objects can be described by vectors and linear transformations, which are represented by matrices.

So, for many interesting situations, such as geometric transformations for 3D computer game graphics, important relationships are expressed as:

Here y and x are vectors, and M is a matrix.

These important representations are not always intuitive at a glance; when we look at the matrix M, it is unlikely to be obvious what it is going to do to the vector x when you multiply the two together. Should we need to study the iterative effect of M, i.e M to the power of k iterations, then the computations can become costly if we do them in a naive way.

Recall the identity matrix. Any vector multiplied with the matrix will simply return the same vector! For this such vector, it is clearly easy to imagine how it would change after k amounts of iterations of multiplying it with M; there would be no change at all; no matter how many times you kept multiplying the vector and identity matrix!

An eigenvector belonging to matrix M is any vector that only gets scaled when it is multiplied with M. More formally, this leads to this equation:

The upside down y is called lambda, and it represents a real or complex number (depending on the matrix M that we are looking at.)

Given that M may be a matrix describing a geometric operation that could in principle stretch and rotate vectors. That being said, the eigenvector x will only get stretched, not rotated.

One meaning of the word ‘eigen’ is characteristic. I think it makes intuitive sense that the eigenvector x is a characteristic vector of the matrix M. So eigenvectors are ways to decompose matrices in manners that expose their functional properties. Let us consider that matrix M has n linearly independent eigenvectors called x1, x2 and so on. We can concatenate all of the eigenvectors into a matrix X. All of the eigenvalues can be formed into a vector and we can compute the eigendecomposition of M by:

The decomposition reveals useful facts about the matrix; for example, if any of the eigenvalues are zero, then the matrix is singular — which means that the matrix is not invertible and therefore doesn’t have a matrix inverse.

Once we have our eigenvectors for our square matrix, we can calculate the determinant, which describes how much a transformation expands or contracts a space. If the determinant is zero, then space has entirely contracted along a dimensions, whereas if it is one, then the transformation preserves its volume. Summing it up; the determinant is a method to map a square matrix to a scalar, that describes the amount of change in the space after applying the matrix to the space.

The last topic of this article is Principle Component Analysis (PCA.) PCA is used to reduce the dimensions of a dataset whilst minimising the amount of information lost. PCA is a nice example of eigendecomposition and some of the other things we have learnt about in this article. Enjoy!

Thanks for getting this far — I really appreciate you taking the time to read my article!

I would love to hear from you, please drop me a message here or on twitter, and if you liked this article then leave some claps and give me a follow!

Cheers everyone, and keep an eye out for the next article — which has an excellent chance on covering probability theory.

A Hackers Guide To Deep Learning’s Secret Sauces: Linear Algebra

Written by Leon Fedden