Understanding your data: Principal Component Analysis

Published in

DeepDreaming

11 min readOct 30, 2019

*3D representation of a dataset of 28x28 grayscale images obtained through PCA. On the left the “Fashion MNIST” dataset, on the right the original MNIST digit dataset.*

In the following article, I’m going to explain what PCA is, how it has been derived, how to implement it in Python and what kind of applications it has in the world of machine learning and data science.

What is PCA?

Data are often very high dimensional, for example, a grayscale image in resolution 500 x 500 pixels, where each pixel is a light intensity value, can be viewed as a point in a 250 000 dimensional space. In this representation, each component represents the amount of light in one of the pixels of the image.

You can imagine a set of images of the same size as points in this very high dimensional space. If we would be able to observe this dataset, we would notice that data points are not distributed homogeneously in the space, but they would probably be positioned near a hyperplane. We can ask our self: do we really need all these dimensions to accurately represent our data?

Furthermore, high dimensional spaces are victims of “the course of dimensionality”, a well-known problem in data science. In this scenario, the principal components analysis (PCA) comes to play.

Instead of trying to craft a new definition among many that say the same thing, I would like to report the following definition of PCA and explain it.

Principal component analysis is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

Let’s try to unpack this definition. In the following, I will assume that the reader knows some basic linear algebra and statistics.

An orthogonal transformation is a transformation of the space, where the corresponding transformation matrix is orthogonal, which means that every vector (columns and rows) in the matrix is orthogonal to all the others and has unit norm. In other words, the column vectors of the matrix constitute an orthonormal basis. Such transformation visually corresponds to either a rigid rotation or an improper rotation (a rotation followed by a flip). In particular, an orthogonal transformation preserves lengths of vectors and angles between vectors. In our case vectors are data points, so applying PCA won’t break the relationships between them, thus leaving unaltered the original information contained in the dataset.

The possibly correlated variables the definition is talking about are the original dimensions of the dataset, while the uncorrelated variables will be the new dimensions that PCA will find.

The correlation between two random variables is defined as:

Where cov(X,Y) is the covariance between the two variables, which measures how much two variables are linearly related to each other, as well as the scale of these variables. So, the correlation is the covariance normalized by the contribution of each variable, in order to measure only how much the variables are related (without being affected by the scale of the separate variables).

*Some data distributions in 2D and the correlation between the two dimensions. As you can see from the last row, correlation is insensible to non-linear relationships.*

Why do we prefer uncorrelated dimensions? Imagine we want to represent the well-being of a person in terms of temperature and humidity in a binary classification problem. We might represent our dataset using these two variables (dimensions), where each sample is a point in this 2D space with a label of 1 if the person is feeling good or 0 if the person is uncomfortable. Now suppose we are analyzing this problem in a geographical area where temperature and humidity are highly correlated, we may find that when the temperature goes up, the humidity goes down and vice-versa. In this case, having two dimensions is not allowing us to represent anything different than what we can represent in a single dimension that incorporates both the behavior of the temperature and humidity. So, we can basically reduce our representation from 2D to 1D without really losing any information. Notice that since correlation only takes account of linear relationships, to make this toy example fit PCA, I should specify: “when the temperature goes up, the humidity goes down and vice-versa, with a linear relationship”. This also means that PCA only removes linear dependencies between components; other techniques exist to remove non-linear dependencies.

The last ingredient to our definition is variance. The variance of a random variable tells us how much the values of the variable are fare away from his mean. Now with a little bit of intuition, we should be able to understand why we want a lot of variance along the components of our dataset. I want to report these two sentences from the Deep Learning book (I. Goodfellow, Y. Bengio, A. Courville, MIT Press, 2016):

Learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

For this reason, variance is a measure of information: the more the data are different from the expected value, the more they are telling us “unexpected” things, thus allowing us to learn more from them.

To recap, PCA allows us to transform a dataset into a new set of dimensions, with zero covariance between them. Furthermore, it tells us how much variance each component carries. Knowing that we might decide to discard the components that carry less information, thus reducing the dimensionality of our dataset without losing much information. Powerful right?

As a side note, PCA can be viewed as an unsupervised learning algorithm because the new components are computed purely from the original dataset, without any additional supervising information on the training phase (a.k.a. labels).

In the image above, the dataset was originally expressed in terms of x1 and x2. After PCA, which can be seen as a rigid rotation, the new space is identified by z1 and z2. The maximum variance of the data is across z1, therefore removing z2 would not make us loose much information. (Image credits: Deep Learning, I. Goodfellow, Y. Bengio, A. Courville, MIT Press, 2016).

Obtaining the principal components

There are mainly two ways of obtaining the principal components of a dataset: by doing the eigenvalue decomposition of the data covariance (or correlation) matrix or by doing singular value decomposition of the design matrix. We chose to follow the first path.

We have already talked about correlation and covariance; the covariance matrix of a random vector X, is the matrix whose (i,j) entry is the covariance:

So if X is a D-dimensional vector, the covariance matrix will be a D by D matrix. The covariance matrix (S) of the components (features) of a dataset (X) can be computed as:

Where X is a DxN design matrix where each row represents a feature and each column represents a sample. Each data sample is therefore represented by a D-dimensional vector. Notice that if each the random variables of the vector are uncorrelated, the covariance matrix will be a diagonal matrix.

It turns out that we can obtain the principal components of our dataset by finding the eigenvalues and eigenvectors of the covariance matrix of the dataset. Each eigenvector will represent a component in the new space, and his eigenvalue will be a quantity proportional to his variance. So, we can now decide how many (and which) eigenvectors to keep preserving enough variance in our data, effectively defining a subspace of the original data space.

We can obtain our transformation matrix (B) as the matrix where each column is an eigenvector of the components that we want to keep. By applying transformation B to a data vector (x) we will obtain his projection in the subspace spanned by the columns of B, known as the code (c). By applying the inverse transformation to the code we will obtain the reconstruction of the original vector in the original space (x̄). Since B is an orthogonal matrix (the inverse is equal to the transpose) we can write:

In this case, we suppose to keep M components (with M < D), so the code will be M-dimensional. The accuracy of this approximation depends on how many components we choose to keep. If M = D than the reconstruction vector will be equal to the original data vector.

Let’s derive PCA (optional)

If you are curious about where PCA comes from, here I’ve reported the high-level steps of his derivation. The goal is to find a new orthonormal basis for the subspace of size M that captures the best representation of the original data. We will define what “best” means for us through a convex cost function J, which takes on his minimum value when we use the orthonormal basis that matches our criteria.

We can write our original data vector as a linear combination of the original space basis vectors. Here we are supposing that the basis vectors are an orthonormal basis.

Our approximation after PCA would be:

where we are effectively using only a subspace of the original space.

We will phrase this as an optimization problem. Finding the principal components to represent our data means we want to minimize the average distance between the original data vectors and their reconstructions.

So we should find the partial derivatives with respect to the parameters and set them equal to 0 (here we are using the chain rule of derivatives):

If we make the math we will obtain that:

Knowing this expression of the parameters and exploiting the orthonormality between the basis vectors, we can express the difference between an original data vector and his reconstruction as:

After some computation we can obtain this expression of J:

Looking carefully, we can identify the expression of the covariance matrix S, so:

By using the trace operator we get:

Where the quantity that multiply S is the projection matrix of the subspace that we ignore. This projection matrix takes the data covariance matrix S and project it onto the orthogonal complement of the principal subspace. Therefore, minimizing this loss is equivalent to minimizing the variance of the data that lies in the subspace that we ignore. In other words, we are interested in maintaining as much variance as possible after projection.

For simplicity, let’s solve the optimization problem in R2, keeping only 1 component. If we call b2 the component we want to discard we will have:

We can solve this constrained optimization using Lagrange multipliers:

We will obtain that b2 must be an eigenvector of S with eigenvalue λ,

therefore, minimizing J is the equivalent of choosing the smallest eigenvalue and his corresponding eigenvector as the component to discard.

This means that we want to minimize the sum of eigenvalues of the components we discard, thus keeping the eigenvector associated with the biggest eigenvalues as principal components.

Implementing PCA

We will now implement PCA in Python using numpy e sklearn. We will apply PCA on the Fashion MNIST dataset, which contains images of clothes and accessories from Zalando. The dataset contains 60.000 images in the training set and each sample is a 28x28 grayscale image, so in total, we lay in a 784-dimensional space.

As a first step, we normalize the data, subtracting the mean value across features and dividing by their standard deviation. This step is necessary, as a matter of fact, the goal of PCA is to identify the components with maximum variance. Unnormalized data (e.g. different measurement units) can distort the relative comparison of variance across components.

The provided Python program does mainly two things:

1) It plots the component-wise and cumulative explained variances against the number of dimensions. The cumulative explained variance is the percentage of the total variance that will remain in the dataset by keeping N components. The component-wise variance is the percentage of variance contained in each component.

2) It shows one of the original images from the dataset, the same image obtained using only the first N principal components and using only the last N components.

*With less than 200 components (horizontal axis) the cumulative explained variance is already greater than 95% of the total.*

The image obtained using 120 components (center) looks very similar to the original image even though is represented in a subspace with a number of dimensions that is only 15.3% the number of the original dimensions. On the contrary the last 120 components carry almost no information.

To obtain the explained variances and the image with the first N components I have used the standard PCA available in sklearn. Since it is not possible to obtain the last-components image directly with sklearn’s PCA, I have also implemented PCA from scratch using the formulas illustrated above. Since the design matrix (X) is given as NxD, in my implementation of PCA I’ve transposed the matrix before applying the computation to match the convention used above. The code is available in the Jupyter notebook below.

Google Colaboratory

PCA Jupyter Notebook

Where is PCA useful?

Here are some useful applications of PCA:

1) Reducing the size of a dataset before applying a machine learning algorithm. This will reduce problems due to the course of dimensionality, provide faster training and may reduce overfitting.

2) Compressing data to occupy less memory. For example, to compress an image we can divide it into small patches and apply PCA to those patches.

3) Filter noisy data. It turns out that the noise is mostly contained in low variance components, so representing the data with a smaller subset of high-variance components preserves the signal and discards the noise.

4) Visualize high dimensional data. The highest space that humans are able to visualize is probably a 3D space, so we can always apply PCA to keep the three most significant components and plot them.

5) Features engineering: understanding important features and patterns in data. By carefully analyzing the principal components one might try to understand the meaning of each component and discover the most important features in a dataset.

Sources

The demonstration of PCA was extracted from the Coursera course: “Mathematics for Machine Learning: PCA”, Imperial College London.