Mathematics for Machine Learning: Linear Algebra

17 min readApr 25, 2018

This next part of my journey is about making sure I have a solid understanding of the Mathematics required for Machine Learning. These are concepts that I have studied before somewhere, sometime in my education, and hence will serve as a refresher and perhaps giving me new insights as well.

I enrolled in the Mathematics for Machine Learning specialization by Imperial College London on Coursera. This course on Linear Algebra is the first part out of three. So, I took notes and practiced visualizing concepts through the use of analogies and this is what I have learnt from this course.

Vectors

Vectors can be thought of as attributes. In a geometric interpretation, vectors have a length and a direction. Vectors can be scaled, being called the scalar multiple of the vector. Also, vectors can be added together.

Vectors have associativity in vector addition, (r+s)+t = r+(s+t). The same applies with vector subtraction.

Order of addition does not change resultant vector

Dot product, Modulus, and the projection of Vectors

The dot product of a vector with another vector is one vector’s magnitude in the other vector’s direction multiplied by the length of the second vector.

Firstly the dot product of two vectors can be computed as such:

r·s = r1s1 + r2s2 + … + rnsn

The length of a vector can also be called its size or magnitude. This is given by |r| for vector r. This magnitude can be calculated using the Pythagoras Theorem where a² + b² = c².

Example of Pythagoras Theorem with a = 3, b = 4, and c = 5

Hence, |r| = √(a²+b²) for r = [a, b].

A special case to note is the dot product of a vector by itself. The dot product of a vector with itself is simply the length of the vector squared. r·r = |r|².

Let r be the vector [r1, r2]. r·r would then be r1r1 + r2r2 or r¹² + r²². Recalling the Pythagoras Theorem, this givens us the length of the vector squared, |r|².

Vector dot products are commutative, meaning r·s = s·r.

Also, they are distributive over addition, r·(s+t) = r·s + r·t.

Lastly, there are associative over scalar multiplication, r·(as) = a(r·s) and r(as) + R(aS) = a(rs+RS).

One vector’s magnitude in the other vector’s direction can be also thought of as the shadow the first vector casts on the other vector. This length is |s|cos θ.

Given two vectors r and s, we can use the cosine rule to find the dot product r·s.

The cosine rule is as follows, c² = a²+b² -2abcos(θ). Using vector subtraction, we would get c² as|r-s|². Then, after substituting the vectors in, |r-s|² = |r|² + |s|² -2|r||s|cos(θ).

As we know that r·r = |r|², (r-s)·(r-s) = r·r -s·r -s·r -s·s = |r|² -2s·r +|s|²

Since (r-s)·(r-s) = |r-s|² similar to what was shown above, we get the following:

|r|² + |s|² -2|r||s|cos(θ) = |r|² -2s·r +|s|²

-2|r||s|cos(θ) = -2s·r

Hence we get r·s = |r||s|cos(θ) where θ is the angle between r and s.

|s|cos(θ) is the part of S that is projected in the direction of R

From the graph, and knowing cosine is given by the adjacent length divided by the hypotenuses, we get |s|cos(θ) as equal to the scalar projection of s onto r. r·s = |r| * scalar projection or (r.s)/|r| = |s|cos(θ).

The scalar projection gives us the length of the projection. The vector given by the projection is the vector projection of one vector onto another. The vector projection of s on r is given by the scalar projection as a scalar multiple of a unit length vector in the direction of r.

Getting the unit length vector of a vector is just the vector itself divided by its length, r/|r|.

Vector projection = Scalar projection * r as unit length

= [(r·s)/(|r||r|)] * Vector r

= [(r·s)/(r·r)] * Vector r

Bases and Natural Bases

A basis is the basic unit vector in a coordinate space.

Basis sets consists of basis vectors that are linearly independent, meaning they are not linear combination of the other linear vectors in the basis set. For example, basis sets in three dimensional space, linear dependence is attained when bi = xbj + ybk does not hold true for any combination of the vectors as bi, bj, and bk.

Another way to determine linear independence is to take the simultaneous equation of a[x1,x2,x3] + b[x1,x2,x3] + c[x1,x2,x3] = [0,0,0] and find out if a = b = c. If all variables are equal to each other, the vectors are linearly independent.

The basis vectors in a natural basis set must span the space, meaning the vectors in the set must take up the whole n-dimensional space equal to the number of vectors.

Natural Bases or Standard Bases are special bases of unit length one. An Orthonormal basic vector set consist of these special natural basis vectors of unit length where additionally, all these vectors are perpendicular to each other.

Coordinate Systems and their Basis Vectors

The standard coordinate system used in two-dimensional graphs make use of x as the horizontal coordinate and y as the vertical one. Each coordinate system uses basis vectors, the fundamental unit used to measure coordinates in the system. For the standard two-dimensional coordinate system, the two basis vectors are ê1 = [1, 0] and ê2 = [0, 1].

Basis vectors in a coordinate system can be used to describe vectors. For example a vector r described in the standard coordinate system can be states as such. r = [3, 4] = 3ê1 + 4ê2.

Vectors can be translated from one basis into another. In an alternative basis b, we have two new basis vectors b1 and b2.

If these two basis vectors are perpendicular to each other, then it is possible to use projection to find r in the alternative basis. To check perpendicularity, we ensure that (b1·b2) = 0 or cos(θ) = 0.

To find r in the alternative basis b, we take the dot product of the r against the two basis vectors of b respectively. For r = [r1, r2], we take (r1·b1)/|b1|² as R1 and (r2·b2)/|b2|² as R2. the vector in the new basis would be R = [R1, R2]. The general rule is, for each dimension j in the basis vector space, Ri = ri·bj/|bj|².

Basis vectors in one coordinate system can also be described in another coordinate system. The transformation of these vectors form a matrix which can then be used to translate vectors from one basis to another basis.

Matrices

A matrix is a rectangle grid of values of the form m by n where m is the number of rows in the matrix while n is the number of columns.

Matrix multiplication works by multiplying each element in the i-th row of the first matrix with its corresponding element in the j-th column in the second matrix, summing the results, and applying it to the resultant matrix in the i-th row and j-th column. Repeat for all entities in the matrix to complete the matrix multiplication.

For the matrix multiplication of an m1 by n1 matrix with a m2 by n2 matrix, the resultant matrix will be of size m1 by n2. Also, for matrix multiplication to work, the magnitude of n1 and m2 must be equal.

([a, b], [c, d])·(x1, x2) = (ax1 + bx2, cx1 + dx2)

Another way to look at matrices would be to view them as transformation of basis vectors. For a vector in the coordinate system with basis vectors [1, 0] and [0, 1], a matrix multiplication with ([a, b], [c, d]) transforms the two basis vectors into [a, c] and [b, d] respectively. Effectively, this translates the vector into the alternate coordinate system with the [a, c] and [b, d] basis vectors.

Special Types of Matrices

Identity Matrix

Denoted by ‘I’, the identity matrix consists of ones along the leading diagonal and zeros everywhere else. The leading diagonal starts from the top left cell going to the bottom right cell.

Mirrors

These are matrices that reflect vectors along a plane. All these examples assume a standard starting basis vector set of [1, 0] and [0, 1].

([0, 1], [1, 0]) reflects vectors along the counter-diagonal plane. Runs through the origin and angled at 45 degrees to the horizontal plane.

([1, 0], [0, -1]) reflects vectors along the horizontal plane.

([-1, 0], [0, 1]) reflects vectors along the vertical plane.

([0, -1], [-1, 0]) reflects vectors along the leading diagonal plane.

Shears

These matrices shift only one basis vector along the plane perpendicular to its direction.

([1, 0], [1, 1]) shifts the [0, 1] vector 1 unit length to the right.

Rotations

These matrices rotate the basis vectors around the origin. The matrix corresponding to the desired rotation amount is given by ([cos(θ), sin(θ)], [-sin(θ), cos(θ)]) where θ is the angle of rotation anticlockwise.

The order in which these matrix manipulations are applied is determinant of the results attained, the matrix multiplication are not commutative.

Inverse of a Matrix

Taking the dot product of a matrix with its inverse will give the Identity Matrix. This special matrix multiplication is commutative.

Take an example where there is a matrix A, and two vectors, r and s, where r is unknown and s is known. If Ar = s, finding the value of the vector r can be done via Gaussian Elimination and Back Substitution.

After performing Gaussian Elimination on a matrix, the matrix will be in Row Echelon Form. This is a triangular matrix where all entries below the leading diagonal have a value of zero. Solve for the vector values by back substituting the lower rows into the upper rows.

Perform Gaussian Elimination:

Subtract row 1 from row 2 and row 3.

Row 3 ends up being (0, 0, -1)[c] = [-2] so we multiply it by -1.

This brings us to the Row Echelon Form.

Perform Back Substitution:

Substituting row 3 into row 2, row 2 becomes (0, 1, 0)[b] = [4]. Substituting row 2 into row 1, row 1 becomes (1, 0, 0)[a] = [5].

After these substitutions, we will have attained the Identity Matrix.

While doing the above manipulations, make sure to perform the same operation on the vector on the right. The resultant vector will be the solution for vector r.

Using the same steps, we can find the inverse of a matrix. Using the equation AA-¹ = I, we set each entity in A-¹ as an unknown. The solution can be done one column at a time using the method detailed above.

Alternatively, simultaneously solving for all the vector solutions is also possible. First start with the matrix on the left and the identity matrix on the right. Perform all elimination and substitution actions on both the left and right matrix. When the left matrix is transformed into an identity matrix, the right matrix will be the inverse of the original matrix.

Determinant of a Matrix

The area or determinant of a matrix A ([a, b], [c, d]) is ad-bc or denoted as |A|.

For a 2x2 matrix, the inverse can be constructed by swapping the a and d entries in the matrix, multiplying the b and c entries by -1, and scaling the whole matrix down by a factor equal to the value of the determinant.

The determinant can also be thought of as the amount the original space enclosed by the previous basis vector have been scaled.

Reduced Dimensionality

The dot product of a matrix with another matrix consisting of vectors that are not linearly independent will cause a loss in information as the vector space is reduced. A matrix with vectors that are not linearly independent is called a singular matrix as it does not have an inverse. Hence, due to its singular nature, or the lack of an inverse, the transformed vector space cannot regain the loss information when brought back to the higher dimension space.

Einstein Summation Convention

Let matrix A = ([a11, a12, … a1n], [a21, a22, … a2n] … [an1, an2, … ann]), a n by n matrix. The first number denotes the row while the second number denotes the column. Let matrix B be another n by n matrix with the same convention. AB would be the matrix multiplication of matrix A and matrix B.

Take for example a cell in the matrix in the 2nd row and 3rd column, ab23.

ab23 = a21b13 + a22b23 + … + a2nbn3

Generalizing, for the i-th row and k-th column in the transformed matrix, we find that abik = ∑(aijbjk) for each cell j in the row or column respectively. In the Einstein Summation Convention, this is denoted as just abik = aijbjk.

Changing Basis

Translating from a standard basis of [1, 0] and [0, 1] to an alternative basis of [3, 1] and [1, 1] would require vector to be transformed by a conversion matrix. A vector that is [3/2, 1/2] in the alternative basis would be ([3, 1], [1, 1])·[3/2, 1/2] or [5, 2] in the original basis.

Projection computations of the vector can also work if the basis system uses an orthonormal basis vector set.

A example exercise would be finding the 45 degrees anticlockwise rotation for a vector whose coordinates are in the alternative basis. Finding the correct matrix to perform the rotation in the alternative basis would be a difficult approach to the problem. We know the rotation matrix formula in the standard basis, and thus doing the rotation in the standard basis would be simpler.

Basis change from alternative to standard basis
Rotation in the standard basis using ([cos(θ), sin(θ)], [-sin(θ), cos(θ)])
Basis change back to alternative basis

Transposing a Matrix

The transposition of a matrix involves mirroring all the cell of the matrix along the leading diagonal such that each row becomes each column respectively and vice versa. AijT = Aji where i is the row number and j is the column number.

For an orthogonal matrix A, its inverse is also the transpose of the matrix. So computing the inverse just requires a transposition of the matrix. A-¹ = AT

Gram-Schmidt Process

This process allows us to produce an orthogonal matrix from a non-orthogonal one.

This example will use a three-dimensional matrix. Start by labeling the three vectors, v1, v2, and v3. These vectors may or may not be perpendicular to one another, but we will assume that together, these three vectors span the three-dimensional space. The goal is to find three orthonormal basis vectors from these vectors and label them, e1, e2, and e3.

Let e1 be v1/|v1|. The first vector is just normalized.
v2 consist of the part parallel to e1 and the part perpendicular to e1. Let u2 be the perpendicular part.

v2 = (v2 ·e1)e1 + u2

u2 = v2-(v2 ·e1)e1

e2 = u2/|u2|

3. v3 consist of the parts parallel to e1 and e2 respectively and the part perpendicular to both of them.

v3 = (v3 ·e1)e1 +(v3 ·e2)e2 + u3

u3 = v3-(v3 ·e1)e1 +(v3 ·e2)e2

e3 = u3/|u3|

This concludes the Gram-Schmidt process and gives us the three orthonormal basis vectors, e1, e2, and e3.

Example for Basis Change and the Gram-Schmidt process

This example makes use of the two concept. The coordinate space is three-dimensional. The goal would be to reflect a vector off a plane in an alternative basis. Given a vector in the standard basis, determine the coordinates of the vector, in the standard basis, when reflected off the plane in an alternative basis. In this example we assume the plane in the alternative basis to be its first two basis vectors.

Finding the matrix in which to mirror the vector against a plane in an alternative basis while computing in the standard basis is difficult and involves a lot of difficult trigonometry. However, as mention before, the simple way would be to do the mirroring after doing a change of basis.

Use the Gram-Schmidt process to find an orthonormal basis vector set for the alternative basis.
Change the vector to the alternative basis using a matrix form from the orthonormal basis vector set.
Mirror the vector in the alternative basis, Simply mirror the third vector, changing it from [0, 0, 1], in the third row of an identity matrix, to [0, 0, -1].
Change the vector back to the standard basis. As the initial transformation to the alternative basis uses an orthonormal basis vector set, the inverse of the transformation is just the transpose of the matrix.

Eigen Problems

Eigen means characteristic. In the case for matrix transformations, vectors may change in way of scale and direction. An Eigenvector of a matrix transformation is one that does not change directions after the transformation has taken place. The Eigenvalue of this Eigenvector is the scale multiple of the magnitude of the Eigenvector before and after matrix transformation. Eigenvalues can range from negative infinity to positive infinity. The special case of an Eigenvalue of zero is when a dimension of the space has collapsed.

Different matrix transformations have different Eigenvectors. Here, we look at the scale, the shear, and the rotation in two-dimensional space.

The unidirectional scale has two Eigenvectors, one in the directions of the scaling, with an Eigenvalue equal to the magnitude of the scaling, and the other in the perpendicular direction, with an Eigenvalue of one.

If the scale is a uniform one in all directions, all vectors are Eigenvectors with the Eigenvalues equal to the magnitude of the scaling.

The pure shear has only one Eigenvector, the basis vector that does not change. As the basis vector does not change, the magnitude is constant with an Eigenvalue of one.

The rotation has no Eigenvectors. An exception would be the 180 degrees rotation in which all vectors are Eigenvectors with Eigenvalues of negative one.

In three dimensional space, the rotation in two-dimensional space would have an Eigenvector at the axis of rotation with an Eigenvalue of one.

Determining Eigenvectors of a matrix transformation

For an Eigenvector x, Ax = λx where A is a n by n transformation matrix.

Manipulating the equation, (A- λI)x = 0. Ignoring the trivial solution of x = 0, we solve for A- λI = 0 instead. This means finding when det(A- λI) = 0.

Let A = ([a, b], [c, d]).

det(([a, b], [c, d])-([ λ, 0], [0, λ])) = 0

det([a- λ, b], [c, d- λ]) = 0

λ²-(a+d)λ + ad-bc = 0

λ is the solution to find Eigenvectors. There may be two, one, or no solutions. For each solution, substitute the λ value into the equation, A·([x1, x2]) to find out possible values for and how the basis vector relate to each other and hence possible Eigenvectors.

Example relations and how to interpret them:

[0, x2] = 0, hence λ= [t, 0] where t is arbitrary.

[x1-x2, x1-x2] = 0, x1 = x2, hence λ = [t, t]

[x1+4x2, x1+4x2] = 0, x1 = -4x2, hence λ = [-4t, t]

Diagonalization and the Eigenbasis Conversion Matrix

Sometimes matrix transformations need to repeat multiple times. As the repetitions increase, the computational time taken also scales. Making use of diagonalization through the Eigenbasis Conversion Matrix reduces the computation required.

A diagonalized matrix is basically an identity matrix except each value in the leading diagonal is replaced with the Eigenvalue of the Eigenvector that is represented in the column. The Eigenbasis Conversion Matrix makes use of the special properties of diagonalized matrices where applying the matrix n number of times is similar to applying just one matrix where each value is multiplied to the power of n.

Repeating the Diagonal Matrix Transformation

The method is primarily used when a matrix needs to be applied multiple times. Transforming the matrix into an alternative basis where the matrix is now a diagonalized matrix makes the computation process for the multiple applications of the matrix easier. The results, now in the alternative basis, can then be translated back into the original basis to get the solution.

Here is how to use the Eigenbasis Conversion Matrix for a matrix T.

Solve for the Eigenvectors and their Eigenvalues of the matrix T
Let C be the conversion matrix where each column contains an Eigenvector of matrix T, this converts a matrix from the alternative basis back to the standard basis
Let D be the matrix T in the new alternative basis where the value is each column is changed to the Eigenvalue of the respective column in matrix C
Find C-¹, which converts to the alternative basis
Using the formula T^n = CD^nC-¹ to perform the desire amount of matrix T transformations

Page Rank Algorithm

Google’s famous search ranking algorithm makes use of click probability. The premise is that pages are connected to other pages via links. So each of these linked pages are assumed to have an equal probability to be the next page the user goes to.

Let L be the matrix containing the vectors, in columns, of each page’s link probability to the next page.

For this example of a mini internet, only four pages are present, A, B, C, and D. A links to B, C, and D. B links to A and D. C links to only D. D links to B and C. Assuming an equal change for the user to click on each link, the link chance to each page would be one divided by the number of links on the page.

LA = [0, 1/3, 1/3, 1/3]

LB = [1/2, 0, 0, 1/2]

LC = [0, 0, 0, 1]

LD = [0, 1/2, 1/2, 0]

Each column represents the outward links normalized to chance. Each row represents the inward links from other pages.

Another matrix R stores the ranks of these four pages. In this algorithm, the three things essential to calculate a particular page’s rank are, the ranks of all the other pages, information on the incoming links to the page, and total outgoing links from each page.

The formula to calculate the rank of each page is as given. For page A, RA = ∑i(LAi * Ri). Using Einstein’s summation convention we get R= LR.

As computing the value of the rank requires the rank value in the first place, we need to have an initial value. After computing each iteration, for all the ranks, this value will get us closer to the actual rank value.

The initial value for each rank will be set by averaging among all the pages, 1/n where n = the number of pages. For the case we are using, we get 1/4 as the rank value for each page.

R(iteration i+1) = LR(iteration i)

Keep doing the equation until R converges, or until the value change between iterations is below a certain threshold.

The transformation will always hold the sum total of R constant at one as the matrix L consist of Eigenvectors with an Eigenvalue of one. This is guaranteed by the probabilistic structure L is built upon.

The real internet uses a much more sparse matrix as each pages usually only links to a few other pages. The total pages on the internet vastly outnumbers the links on each page. Hence, this simply formula does require a few additional parameters to better estimate page rankings.

Damping Factor

The previous algorithm has been updated from its previous form to include the damping factor.

This damping factor takes into account the chance that a user may choose to access a random page on the internet by typing its URL into the address bar.

By including the damping factor, the formula is changed as such.

R(iteration i+1) = d(LR(iteration i)) + (1-d)/n where the second portion of the formula representing the probability that a user views a new page via URL instead of through a link.