Mathematics for Machine Learning: PCA

Isaac Ng
18 min readOct 14, 2018

--

This is the last course for the Mathematics for Machine Learning. After completing this course, I learnt how to apply Principal Component Analysis, PCA, in a practical way via python code.

I think the important aspect of this course is understanding how and why these algorithms work. Learning the fundamental mathematical framework of it all does allow me to visualise the concepts better in my mind. I believe this knowledge will help me understand which methods are most optimal for different types of data based off preliminary analysis on the data itself.

Principal Component Analysis

Principal Component Analysis is a form of dimensionality reduction. It analyses and then exploits the structure of the data and the correlations between the different variables within the data set.

The key goal of PCA is to achieve a more compact model with lower dimensions without losing vital information in the data set.

Outline

The PCA course is segmented into four lessons.

  1. Means & Variances
  2. Vector Spaces
  3. Lower-Dimensional Sub-spaces
  4. Derive PCA using orthogonal projections

Means & Variances

Starting an analysis on a new data set should begin with gathering useful statistics about the data set. This allows a general overview of what the data structure is like, and allow an understanding of possible further analysis.

Useful basic statistics about the data would include the averages for each variable in the data, the spread or variances of these variables, as well as the orientation of the data in the data set, the variable co-variances.

The mean is the average of the data points in the data set.

The variance is the measure of spread of the data points in the data set.

Positive Definite Matrix

For an n x n real matrix M, and any non-zero column vector z of n real numbers.

For a similar Hermitian matrix M, and any non-zero column vector z of n complex numbers,

A Hermitian matrix is one that mirror the real portion of the complex number while the imaginary portion is reversed in sign. This means that the numbers are complex conjugates along the mirrored axis, the leading diagonal.

Similarly, other terms are used to define the different outcomes.

  • Positive semi-definite = positive or zero
  • Negative semi-definite = negative or zero
  • Negative definite = strictly negative

The identity matrix is positive definite.

Covariance and Correlation

The covariance between the variables show the orientation of the data points in the data set.

Different data sets may have similar variances and means for its variables and yet look different on a plot.

Correlation is the relationship between variables where they can be positively correlated, negatively correlated, or there is no correlation.

x & y are negatively correlated

These statistics for the data set can be organised into a data structure called the covariance matrix.

For a 2D data set, there are 4 quantities of interest that are required to complete the covariance matrix. The variance of both x and y, as well as the covariance between x and y.

Of these quantities, the covariances of variables with identical variable components are similar to each other.

This also means that the covariance matrix mirrors along the leading diagonal of the matrix.

The Covariance Matrix is a symmetric positive definite matrix.

If cov[x, y] is +ve, x and y values move in the same direction, positive or negative.

If cov[x, y] is -ve, x and y values move in opposite directions, if one moves in the positive direction, the other moves in the negative direction, vice versa.

If cov[x, y] is 0, x and y values are uncorrelated and there is no discernible connection between one variable’s movement with the other variable’s movement.

For a 3D data set with variables x, y, and z:

Taking an example,

Linear Transformation of Means

Linear Transformation of Variances

Proof:

This is the proof for the matrix form. For the linear form, transposing is not required.

Taking an example,

Code Exercise for Mean/Covariance & Effect of Linear Transformation

Orthogonality

Orthogonality is the perpendicularity of vectors or variables in vector space.

In the figure above, we will look at three things:

  1. Length of vectors
  2. Angles between vectors
  3. Distance between vectors

To obtain these three values, we will use inner products.

Dot product

Example:

The distance between vectors x and y:

Let us go through a few more examples.

Example:

Inner Product

The inner product is the generalisation of the dot product.

For vectors a and b, the dot product computed with the formula,

For inner product, any matrix M satisfying the inner product properties can be used for the formula,

The three defining properties are bilinearity, positive definiteness, as well as symmetry.

Bilinearity:

Positive Definite:

Symmetry:

Let us look at the general notation for the dot product and the inner product.

This is the dot product equation for matrix x with y.

If this ends up as a symmetric, positive definite matrix, it is a valid inner product.

Let us look at an example.

Another example,

Therefore, it does not satisfy the positive definite criterion.

For an inner product, <x, x> should only equal to zero when x = 0.

Taking an example,

Computing Lengths and Distances

This is also known as the norm of x.

Inner products are positive definite. The length of a vector depends on inner product.

The dot product has an identity matrix as basis vectors while the inner product has any matrix fitting the inner product criteria as basis vectors.

Taking an example,

Dot product

Inner product

Properties of Norm of x

For the Cauchy-Schwarz Inequality, it is equal only when x & y are collinear.

Proof:

Distance between two vectors

The distance between two vector is equal to the length of the difference vector between them.

Looking at another example,

Generating Sets and Basis

The goal would be to get a minimum set of vectors where all other vectors can be represented by a linear combination of these vectors.

Angles between Vectors & Orthogonality

Two vectors are orthogonal when the inner product between them is equal to 0. However, this is only with respect to the basis of the inner product.

Using a different basis for the inner product, we find that the two vectors are not orthogonal with respect to this new basis.

As we move on through this course, the aim will be to find basis vectors which are all orthogonal to each other.

Code: Program to compute Distances and Angles

Code: Machine Learning Library in Python, Sklearn

K Nearest Neighbours

The K Nearest Neighbours method is a machine learning algorithm for prediction. The model it uses for its predictions is its data set.

Let us examine one such data set. The Iris Flower Data set is a structured data set that contains four variable describing the flower’s physical characteristics and one variable which denotes the class of the flower.

Variables: Sepal Length, Sepal Width, Pedal Length, Pedal Width

Classes: Iris(Setosa, Versicolour, Virginica)

When we look towards a practical implementation of the KNN Method for prediction in code, the basic outline of the implementation has three steps.

  1. Compute Distances to all points in the set
  2. Find the k ‘closest’ data points
  3. Find majority class of these k ‘closest’ data points (for prediction)

Code: Implementing K Nearest Neighbours

Inner Products for Functions & Random Variables

There are two types of variables that the inner products concept can be applied to.

  • Discrete interval variable
  • Continuous functions

For the most part, the variables we dealt with previously fall under the first category which is discrete interval variables. The concept of the inner product can also be generalised for continuous functions as well, are those with an infinite number of possible values.

Let think of inner products as a way to provide geometry to the vector space which use them on. Some of these vector spaces may appear abstract and unfamiliar compared to traditional Cartesian geometry. The inner products provide a value which can then be used to compared notions of length or angles, magnitudes of certain dimensions or how similar or dissimilar two vectors are.

Functions are vectors

Think of functions as vector spaces where you can do multiplicative and additive operations

Each dimension for the function vector space can be thought of as a component function.

Example:

All variables in this set are orthogonal to each other for integral of -pi to pi.

Random Variables

For two random variables, x & y, which are uncorrelated:

Likewise, as with the continuous functions before, we can obtain geometric properties of these random variables by applying the inner product.

This means that the covariance obtained is symmetric, positive definite, and linear.

Broadcasting Numpy Operations

For arrays with multiple dimensions, operations between these arrays may result in errors due to the differing dimension sizes.

When the array dimensions differ, the array operation may not work. The basic guide to successful array operations is to ensure that for every dimension between the two arrays, they must either be equal or the dimension magnitude for one of them is one.

Examples:

To average the array on a certain axis, using an euclidean norm,

np.linalg.norm(x, axis)

4 x 2 x 3, normalised on axis =1, the second axis, becomes 4 x 1 x 3 or just 4 x 3

Orthogonal Projections

Usually high dimensional data can be compressed into a lower dimension subspace while still retaining most of its data. Most of these dimensions may only contribute to a small percentage of the data set attributes, hence these dimensions may be dropped to the overall dimensional complexity.

For example, a data set with a hundred dimensions may have 3 main dimensions which make up 90% of the variance in the data set. Hence this data set can possibly be compressed to these 3 main dimensions while retaining most of the data.

Projection of 2D vectors onto a 1D subspace

This shows how the vector is projected onto a 1D subspace U with basis b.

Projection of x onto U:

x is projected onto U through the closest distance from itself to the subspace.

The difference vector between x and its orthogonal projection on U is orthogonal to the basis b of U.

It is the transformation matrix that is applied to x to project it onto U.

Let us take an example,

Another example,

Yet another example,

The data lost from the projection is called the reconstruction error. It is the data lost from compressing to a lower dimension subspace.

Projections onto n-D subspaces

Generalising:

The projection matrix is a D x M matrix where D is the number of dimensions in the original subspace and M is the number of dimensions in the subspace.

Comparison of 1D and n-D projections

Scalar multiplication is commutative while matrix multiplication is not, hence:

Projections allow approximate solutions for problem where there is no perfect solution.

Where b does not lie in the span of A, projection allows for the approximate or most optimal solution possible for Ax = b.

Taking an example:

Workings:

Another example:

Workings:

Yet another example:

Working:

Last example:

Code: Project data onto lower-dimensional subspaces

Code: Least squares for Boston housing prices

Principal Component Analysis (PCA)

The principal component analysis is an algorithm that aims to reduce the linear dimensionality of data sets.

The viability of this algorithm is based upon these premises.

  1. High dimensional data often lie mostly on lower dimensional subspaces
  2. Many of the dimensions in the data are highly correlated

The objective of PCA is to decrease the number of dimensions used while achieving the lowest average reconstruction error.

Groups

Vector Spaces (Real-valued)

Vector subspaces

Inherits Abelian group properties, distributivity, associativity, neutral element

Orthogonal Complement

Orthogonal Decomposition

Problem Setting

Minimise the average squared reconstruction error

Matrix Multiplication Recap.

Gradient of a Linear Model

Proof:

Alternatively,

Example:

Removing all derivatives equal to 1,

Another example:

Optimal projection parameters

Assumptions made for the choice of parameters:

Explanation: The basis are all orthonormal bases, hence they will all return a 0. The only exception is when the bases used are identical.

To find the minimum loss,

Hence, the optimal coordinates are the orthogonal projections of the original data point onto each basis vector in the principal subspace.

Optimal Basis Vectors

Next we need to determine the optimal basis vectors.

Rephrasing the loss function:

Therefore we get,

Proof for trace():

Recap for Lagrange Multipliers

Example,

PCA Steps: Find orthonormal basis minimising loss (The principal subspace)

S consist of eigenvectors which are orthogonal to each other. The magnitude of the eigenvalue corresponds to the variance in the direction of its eigenbasis.

General Case:

Let’s look at the general case,

Advice: Make sure data is centered around 0, i.e. E[X] = 0

Optional: Divide all dimensions by their standard deviation. This ensures that the values are unit free and Var[X] = 1 for every dimension.

Units may be deceiving.

Correlations remain intact after these steps.

Project any x* onto subspace u

Revise values:

PCA in High Dimensions

The higher the dimensionality, the more computationally intensive the algorithm will be. To reduce this computational load, we reduce the number of dimensions using eigenvalues.

Case:

The rank of a matrix is the number of linearly independent rows in the matrix. For PCA, this is the dimensionality of the dataset.

If the number of data points smaller than the dimensionality of the data, the data will not be full rank. At most, it contains N-1 non-zero eigenvalues. The matrix is not full rank and linear independence cannot be established among all the dimensions.

The number of non-zero eigenvalues is N-1 due to the data being centered and the mean of each dimension set to 0. Hence even when any one dimension is missing, it can still be computed from the rest of the dimensions using the information that the mean in every dimension is 0.

The matrix can thus be reduced and made full rank.

Code: Principal Component Analysis (PCA)

Other perspectives of PCA

An autoencoder can be linear or nonlinear. Neural networks are examples of nonlinear encoders.

Latent Variable Model

All perspectives lead to the same solution. Determine which to use based on the properties of the data.

  • Minimise squared reconstruction error
  • Minimise autoencoder loss
  • Maximise mutual information
  • Maximise variance of projected data
  • Maximise likelihood in latent variable model

--

--