PCA for image reconstruction, from scratch

Pranjall Kumar
13 min readNov 5, 2021

--

Image reconstruction using PCA, Image by author.

Today I want to show you the power of Principal Component Analysis (PCA). It is a technique of reducing the dimensionality of data, increasing interpretability, but at the same time minimizing information loss.

That being said, let us see how this magic happens! I will showcase a python code for implementing PCA from scratch. This will help you understand the concept in greater detail and will also facilitate practical learning. To visually see the power of PCA and to be able to comprehend it, I have chosen image data. No matter the data is consisting of various kinds of anime waifus! there are still some statistical patterns in the dataset that PCA can learn and interpret the images using fewer features.

So let's begin! I am producing a code that I developed as a Teaching Assistant in my Master’s degree with the help and guidance of Prof. Viswanath Gopalakrishnan for students to understand and implement the code for PCA in the subject, Mathematics for Machine Learning. I am using a subset of the dataset available on Kaggle. The dataset can be found here: Dataset. I am assuming you are using Google Colab. However, It can also be done on a powerful system offline in a Jupyter Notebook using plotly offline if required. I would also recommend using dark mode for the best experience.

The first step would be to mount your Drive to Google Colab. It can be easily done using the following code:

#code to mount drive
from google.colab import drive
drive.mount('/content/drive')

Just follow the instructions go grant permission and you will be ready for the next step which is to upload the dataset. Kindly note that I have made a folder named ‘PCA’ in MyDrive and uploaded the dataset in that folder.

Now we will import the required libraries. Just use the snippet given below. Here we are also setting the default renderer value to Google Colab so that our plotly plots are visible in the notebook.

#importing libraries
import os
import sys
import cv2 as cv
import numpy as np
import plotly.io as pio
import plotly.graph_objs as go


from PIL import Image
from skimage import color
from plotly import subplots
from sklearn.model_selection import train_test_split


#setting the rederer as colab
pio.renderers.default = "colab"

Now we will load the dataset. But before I proceed, I would like to provide a word of caution. I have chosen the values of the number of images to 500 and the image dimensions of 64x64 so that you can run the code flawlessly within the provided resource limit in Google Colab. However, if you are running it locally on a powerful system with RAM 32GB or above, you can make changes to these values to get better results. For example, the number of images provided in the dataset is 1000. Or you can also increase the image dimensions to 128x128.

All that being said, nowhere is the code for loading the images into the system. I will explain the code in detail step by step.

#Loading data and reducing size to 64 x 64 pixels
IMG_DIR = '/content/drive/MyDrive/PCA/Dataset'
X = []
X_flat = []
count = 1
size = 64
total = 500
print("Loading...")
for img in os.listdir(IMG_DIR):
if count == total + 1:
break
sys.stdout.write("\r" + str(count) + " / " + str(total))
sys.stdout.flush()
img_array = cv.imread(os.path.join(IMG_DIR, img), cv.IMREAD_GRAYSCALE)
img_pil = Image.fromarray(img_array)
img_64x64 = np.array(img_pil.resize((size, size), Image.ANTIALIAS))
X.append(img_64x64)
img_array = img_64x64.flatten()
X_flat.append(img_array)
count += 1
print()
print("Done!")
  • The location of the dataset is loaded into the variable ‘IMG_DIR’.
  • Two lists ‘X’ and ‘X_flat’ are created for storing images.
  • A counter ‘count’ is initialized to 1 to keep track of which image is being loaded.
  • ‘size’ is initialized to 64 so as to keep the dimensions of the image 64x64.
  • ‘total’ is initialized to 500 so as to load only 500 images from the dataset.
  • Now each image in the dataset is accessed using the for loop.
  • An ‘if’ condition is used to keep track of the number of images already been loaded. If ‘count’ becomes greater than ‘total’, the loop is terminated.
  • Constant feedback of which image is being loaded is given using the two sys statements.
  • Each image, one by one, is read into the variable ‘img_array’ using openCV’s imread() command. Also, note that each image is converted to grayscale.
  • The image (presently a numpy array) is converted to a PIL image using Image.fromarray() command. This is done to help resize the image.
  • Then the image is resized to 64x64 using the Anti Aliasing method and again converted back to a numpy array.
  • It is then appended to the list ‘X’
  • A flattened version of the images is also stored in the list ‘X_flat’ where a 64x64 image is shaped to (4096,) numpy array. We will be using ‘X_flat’ for further processing. ‘X’ is just for us to see how the images look for now.

I hope the above points made the code very clear. Now let's have a look at whether the images have loaded or not, and also get an idea of how they look like. Below is the snippet to show the first 25 images that were loaded in ‘X’.

#visualizing some images
size = 5
count = 0
fig = subplots.make_subplots(rows = size, cols = size,
vertical_spacing = 0.06, horizontal_spacing = 0.02)
for row in range(size):
for col in range(size):
fig.add_trace(go.Image(z = color.gray2rgb(X[count])),
row = row + 1, col = col + 1)
count += 1
fig["layout"].update(title = "Anime Images", template = "plotly_dark", height = 900)
fig.show()

Once you run the code, you will see 25 images of some really pretty waifus! Kindly note that each of them is a grayscale image and plotly doesn’t plot grayscale images. So I had to use the gray2rgb() utility in skimage to make the grayscale images have 3 channels like in a standard RGB image. The default color model in go.Image() is RGB only. If you don’t understand this plotting of traces in plotly, I would highly recommend you to look into my other article on plotly. It will take you all the way from beginner to advanced plotting in no time! The article can be accessed here: A complete introduction to plotly, from beginner to advanced. Here is the expected output.

25 waifus for you to enjoy in grayscale, Image by author.

Ok! let's not distract ourselves with these pretty faces, and quickly move ahead to the main agenda we have today. Now I am converting ‘X_flat’ into a numpy array in the below snippet. You will see an output (500, 4096). Please remember, one flattened image was (4096,) and I have clubbed all the 500 images that we loaded together. The dimensions of the dataset might fool you to believe that there are 4096 features, but kindly make a note that each image makes a feature, Therefore there are 500 features (which will be reduced to 250 due to test-train split further down the line). [Question1: Can you tell what will happen if we take 4096 as features once you have gone through the entire concept?]

#converting X_flat to numpy array
X_flat = np.asarray(X_flat)
X_flat.shape

Output: (500, 4096)

Now let's do the test-train split. Do note that the split has to be 50–50. You can see that ‘test_size’ attribute is set to 0.5. Also not that I have taken the transpose, to make the data look more like the natural understanding of columns being features and rows being data points. Use the ‘random_state’ attribute to get the same split as mine. [Question 2: Can you answer why we need to split things in equal half’s once you have gone through the entire concept?]

#Test-Train split
X_train, X_test = train_test_split(X_flat,
test_size = 0.5, random_state = 69)
X_train = X_train.T
X_test = X_test.T
print(X_train.shape)
print(X_test.shape)

Output: (4096, 250) ‘\n’ (4096, 250)

So now we have our data ready to train PCA. So let’s start implementing PCA! But first, let me give you a little brief info about how PCA works. PCA can be broken down into 5 steps. I will go through them one by one sequentially. It can be a lot to take in if you are very new to this topic, but I request you to please be patient and read again if required. I have chosen very precise language to put forth the idea in a very subtle way. Unfortunately, there is a lot of reading ahead!

  1. Standardization: We standardize the data to let each feature have an equal contribution to the analysis. Imagine a regression line being plotted for data where x-axis values differ by small quantities like 0.3 to 0.9 (small variation in data), and y-axis values differ by amounts ranging in thousands (large variation in data). Such a line will be very skewed towards the y axis and will not be able to follow variance in data on the x-axis. Thus we need to bring all the features to a comparable scale so that none of them is dominating. In standardization, we make the feature values have to mean 0 and standard deviation as 1.
  2. Covariance: We make a covariance matrix of the given data to find if there is any relationship between features. Because features may be highly correlated suggesting redundant information. Thus we try to remove such features. So to get this insight of highly correlating features we build a covariance matrix.
  3. Eigenvalues and Eigenvectors: This step helps us to determine the principal components of the data. Principal components are new feature values that are constructed as linear combinations or mixtures of the initial feature values. This is done in such a way that the new feature values are uncorrelated and most of the information is made available in the first component. Then remaining in the next and so on. This way we can remove the less important principal components to reduce the dimensionality of the data. Geometrically, eigenvectors (characteristic vectors) are those vectors that do not change their direction when applied to a linear transformation and eigenvalues represent the amount by which they are scaled. So once we find the eigenvectors of the covariance matrix, we get the directions of the data that explain the maximum amount of variance depending on the eigenvalues (large eigenvalue means more variance in that eigenvector’s direction). And since the covariance matrix is symmetric, the eigenvectors will be perpendicular. How convenient! so we can now use them as the axes to explain the dataset. Thus we sort the eigenvectors based on their eigenvalues in descending order to get the important principal components first.
  4. Feature vectors: These are those vectors that make the cut. That is, the selected principal components that we think are sufficient to represent the data without the loss of much information.
  5. Projection to principal components: So far we have just found the directions that best explain the data. Now we reorient the data from the original axes to the ones described by the principal components. This will give us the data with reduced dimensions.

Phew! that seems a lot of work! But if you understand the math properly and use the python concept of vectorization, Coding all this will be a breeze. I would recommend an in-depth reading of these concepts if you are not familiar with some resources by Gilbert Strang. So let's not wait anymore and get our hands dirty!

The following snippets were fill-in-the-blanks for students to complete the code which was then automatically checked using Jupyter Notebook’s nbgrader. But I present the complete code to you guys.

First is a function to do standardization. (Step 1)

#function for standardizing image
def Standardize(X):
#calcualte the mean of each column mu
mu = np.mean(X, axis = 0)

#Substract the mean from X
X = X - mu

#calcualte the standard deviation of each column std
std = np.std(X, axis = 0)

#handleing zero standard deviation
std_filled = std.copy()
std_filled[std == 0] = 1.0

#calculate standardized X called Xbar
Xbar = (X-mu) / std_filled

return Xbar, mu, std

Then a function to calculate eigenvalues and eigenvectors and sort them based on eigenvalues in descending order. (Step 3)

#function for calcualting eigen values and eigen vectors
def eig(S):
#calculate the eigen values and eigen vectors
eig_val, eig_vec = np.linalg.eigh(S)

#sorting them in decrasing order
sorted_eig = np.argsort(-eig_val)
eig_val = eig_val[sorted_eig]
eig_vec = eig_vec[:, sorted_eig]

return (eig_val, eig_vec)

Then comes the function to project the data. (Step 5). I know I am jumping steps but as I said these were the codes I expected students to write, so I had them here as functions clubbed together. Step 2 and Step 4 are taken care of in correct order in the snippet following this snippet.

#function for projection matrix
def projection_matrix(B):
#calculate the projection matrix P
P = B @ B.T

return P

Now the main function to perform PCA.

#implementing PCA
def PCA(X, num_components):
#calculate the data covariance matrix S
S = np.cov(X)

#now find eigenvalues and corresponding eigenvectors for S
#by implementing eig().
eig_vals, eig_vecs = eig(S)

#select eigen vectors
U = eig_vecs[:, range(num_components)]

#reconstruct the images from the lowerdimensional representation
#to do this, we first need to find the projection_matrix
#(which you implemented earlier)
#which projects our input data onto the vector space spanned
#by the eigenvectors
P = projection_matrix(U) # projection matrix

return P

Note how all the code falls in place following each step in order with this code. Now, let's use all these functions that we have made. First, you guessed it, we standardize the data.#standardizind
Xbar_train, mu_train, std_train = Standardize(X_train)
Xbar_test, mu_test, std_test = Standardize(X_test)

Now I am writing a function to calculate Mean Squared Error (MSE) for training.

#function for mean square error
def mse(predict, actual):
return np.square(predict - actual).sum(axis = 1).mean()

Now, with the training data I am building a projection matrix, that projects the given data into the eigenvector feature space that was calculated using the line shown here (projection = PCA(Xbar_train.T, num_component). Then I am using this projection matrix made from the trained data to project test data into this eigenvector space to see what happens, using the line (reconst = Xbar_test @ projection). Can unseen data (an unknown anime waifu) be also reconstructed using this projection matrix? Let's find out!

#calculating loss and reconstructing images
loss = []
reconstructions = []
max_components = len(X_train.T)
print("Processing...")
animation = np.arange(1, max_components + 1, 1)
for num_component in range(1, max_components + 1):
sys.stdout.write("\r" + str(animation[num_component - 1]) + \
" / " + str(max_components))
sys.stdout.flush()
projection = PCA(Xbar_train.T, num_component)
reconst = Xbar_test @ projection
error = mse(reconst, Xbar_test)
reconstructions.append(reconst)
loss.append((num_component, error))
print()
print("Done!")

You can see I am gradually increasing the number of principal components from 1, all the way to max (250 in this case since data had 250 features to start with). I am simultaneously noting down the loss in the reconstruction of images to plot a graph. Let us see the graph and understand what happened.

#visualizing mse vs number of principal components
trace = go.Scatter(x = loss[:, 0], y = loss[:, 1])
data = [trace]
fig = go.Figure(data)
fig.update_layout(title = "MSE vs number of principal components",
xaxis_title = "Number of principal components",
yaxis_title = "MSE", template = "plotly_dark")
fig.show()
MSE vs number of principal components, Image by author.

You can clearly see the MSE decrease as I increase the number of principal components used to make the projection matrix. That means the quality of reconstructed images is getting better and better as we include more and more principal components. Almost to Zero! if all 250 principal components are used.

Let's find out how our waifus are looking! I can’t wait!

First, remember that the data must be unstandardized. Otherwise, the images won’t have correct pixel values. We can do it using the following simple code.

#unstandardizing the reconstructed images
reconstructions = np.asarray(reconstructions)
reconstructions = reconstructions * std_test + mu_test
loss = np.asarray(loss)
print("Done")

Once done, we can have a look at our waifus using the following code.

#plotting the reconstructed images
col = 5
row = 2
size = 10
skip = 20
images = 3


component = 0
titles = []
for loop in range(size):
titles.append(str(component) + " components")
component += skip


figs = [0]*images
for num in range(images):
figs[num] = subplots.make_subplots(rows = row,
cols = col, subplot_titles = titles)
component = 0
for r in range(row):
for c in range(col):
figs[num].add_trace(go.Image(
z = color.gray2rgb(
reconstructions[component, : , num].reshape(64, 64))),
row = r + 1, col = c + 1)
component += skip
figs[num]["layout"].update(title = "Recontructed Images",
template = "plotly_dark")
for num in range(images):
figs[num].show()

The output must look something like this:

Image by author
Image by author
Image by author

3 waifus are ready! you can see I am able to get some adorable faces just by using 180 principal components. That means 70 features were mostly useless (250–180). Imagine finding out that about 30% of features that you are providing to your model are redundant! This information can help attain better performance from the model as a whole.

We all love anime, I hope you now associate PCA with it every time too! I hope I made this topic easy and interesting for you to understand. PCA is a very difficult yet very important algorithm that helps more often than not. I would recommend further reading into Linear Algebra if you didn’t understand what was happening properly. Use resources from Gilbert Strang to get a concrete idea of what are linear transformations, eigenvalues, and eigenvectors, their significance, projections, etc.

That is all from me for this time. Kindly excuse me now. Do answer the two questions in the comments.

I hope you gained something from this article and I wish you the best of luck.

Bye!

Special thanks to Prof. Viswanath Gopalakrishnan for helping me fix major issues in the code.

--

--

Pranjall Kumar

Research Engineer at Siemens. Free time is photography time.