Understanding The Math Behind Dimension Reduction in Facial Recognition(3)

Read my beginner-friendly proofs to explore the applications of linear algebra

Xiaoli Jin
The Startup
7 min readAug 18, 2019

--

Facial recognition usually requires reducing image dimensions. In the first article of this series, I introduced methods to reduce dimensions while preserving full variance. To achieve that, I talked about covariance matrices, orthogonal bases, the Spectral Theorem, eigenvalues, and eigenvectors. In the second article, I dove into the proof underlying the Principal Component Analysis (PCA.) PCA is a common way to reduce dimensions while minimizing the loss of variance. In that article, I demonstrated why eigenvectors with large eigenvalues always correspond to dimensions with large variances.

As the final piece of the series, this article consists of two parts. First, I will apply PCA to the Middlebury College example introduced in the first article. This is meant to let readers see PCA at work. Next, I will talk about the limitations of PCA and introduce its alternatives. Many components in this article are built on previous proofs. If you haven’t checked out my first two articles, I highly encourage you to read them before scrolling down.

Recall that in the Middlebury College example, the communication director stored 3,000 student photos, each of which had 1,600 × 1,200 pixels. He recently saw a photo on local newspaper and he wanted to know which student the photo belonged to. Without any dimension reduction, he would need to deal with a face space R of 1,600 × 1,200 = 1,920,000 dimensions. Now that we have learned PCA, we want to construct a subspace M, such that M preserves at least 90% of R’s variance(the 90% requirement is arbitrary, you can substitute it with any number that fits your application.) Implied in our goal is to minimize the dimension of M. Next, we want create 3,000 new photos by projecting the old photos from R onto M. The communication director would be working with the new photos, the dimension of which is way smaller than the old photos.

To construct M, we go through the following three steps (If you don’t find the definition of a matrix, Lemma or equation, they are probably covered in the first two articles.)

1). We compute total variance of face space R. Let σ² denote the total variance of R. Let dᵢ denote the iᵗʰ column of the variance matrix A. Let λ₁, λ₂ … λn denote the eigenvalues of covariance matrix C , then:

Note that we don’t need to assemble the entire covariance matrix C, since we are only interested in its diagonal entries.

2). We compute eigenvalues (in descending order) of Cᵗ. Recall Lemma 2 proves that every eigenvalue of C is also an eigenvalue of Cᵗ. We stop when the sum of eigenvalues we computed reach 90% of σ². Assume we stop after computing m eigenvalues, then m needs to satisfy the following:

For each of the λᵢ, find the corresponding eigenvector xᵢ in Cᵗ.

3). We convert xᵢ to its counterpart in C. Let tᵢ denote such eigenvector in C. According to equation 12, tᵢ is computed as:

t₁, t₂, …, tm (m as a subscript) form the subspace M we desired. Mission complete! You may have seen the expression eigenface and eigenface subspace somewhere else. In this case, M is the eigenface subspace and each eigenvector tᵢ is an eigenface. We have m eigenfaces in total.

The above steps delineate the theoretical foundation to construct M. To compute M in a practical setting, we can diagonalize Cᵗ and rank its eigenvalues in decreasing order. In situations where diagonalization proves to be computationally expensive (which is, unfortunately, most of the cases,) we can use the power method to approximate the eigenvector with the largest eigenvalue and use the deflation method to approximate the subsequent eigenvectors ranked by eigenvalue. Due to the scope of this article, I leave it to the readers to explore the details of the approximation.

So far we have constructed subspace M, which has a dimension of m. We are now going to create 3,000 new photos by projecting the old photos from R onto M. To do so, we represent each photo as a linear combination of the m eigenfaces in M. Recall that the Spectral Theorem guarantees our eigenfaces are orthogonal to each other. Let vᵢ be the iᵗʰ old photo (vᵢ is a normalized vector representation of the old photo.) Let f₁, f₂, …, fm be the eigenfaces in M. According to the formula of orthogonal projection:

equation 19

The left-hand side of equation 19 is the old photo while the right-hand side is the new photo which preserves 90% of the old photo’s variance.

Let Ωᵢ be a weight vector of length m that stores the projection coefficient of vᵢ onto the m eigenfaces, if we let inner product be dot product, we have:

Now we pull up the photo Middlebury’s communication director found in the newspaper. We first normalize this photo by subtracting the average face k̄ described in the first article. We then compute Ωnew, the projection coefficients of the newspaper photo onto the m eigenfaces. To find out which student it belongs to, we calculate the Euclidean distance ϵᵢ² between each of the 3,000 Ωᵢ and Ωnew:

The newspaper photo is considered as belonging to student i if ϵᵢ² is below some chosen threshold or if ϵᵢ² is the smallest among the 3,000.

Recall that the original student photos have 1,600 × 1,200 pixels. If we let m = 10 (i.e. pick 10 principal components,) it would reduce the complexity of computation to 10 / 1,600 × 1,200 ≈ 0.0005% of the original level. That is why dimension reduction is almost indispensable in the field of image recognition.

Additional Reading: Alternative to PCA

The eigenface approach comes with its own disadvantages, especially when we take facial expressions into account. PCA may do a great job distinguishing between different people, but its performance declines when it comes to class separation, e.g. separating two groups of photos, each featuring the same person with different facial expressions. This is when the linear Discriminant Analysis (LDA) may come to help.

LDA searches for a linear combination of variables that best separate two classes. Like PCA, LDA also reduces dimensions, but with a strong focus of maximizing separability among classes. Generally speaking, LDA picks a dimension that:

  1. Maximizes the distance between the means of projected classes.
  2. Minimizes the variance within each projected class.

To see LDA in action, suppose we want to separate Class A and B. Each class contains a set of photos featuring one person under different facial expressions. Just like PCA, we start with constructing the covariance matrix and finding its eigenbasis. To select the appropriate dimensions from the eigenbasis, we let μₐ and μₑ be the means of Class A and E with respect to a certain dimension; let σₐ² and σₑ² be the variances of A and E when they are projected onto that dimension. Among all available dimensions, we iteratively select the dimension that satisfies:

This is a very basic introduction of LDA. LDA does not always outperform PCA. For example, PCA often performs better than LDA when the distinguishing factors of the data that lie in the variance rather than in the mean of projection coefficients. Readers should exercise their own discretion and pick the right model depending on their data.

That’s all for this series on dimension reduction. If you have any questions, comments, or suggestions about what you want to see next, please leave a message and I will get back to you ASAP.

Reference:

[1] Turk, Matthew, and Alex Pentland. “Eigenfaces for Recognition.” Journal of Cognitive Neuroscience, vol. 3, no. 1, 1991, pp. 71–86., doi:10.1162/jocn.1991.3.1.71.

[2] j2kun. “Eigenfaces, for Facial Recognition.” Math ∩ Programming, jeremykun.com/2011/07/27/eigenfaces.

[3] “Linear Discriminant Analysis.” Wikipedia, Wikimedia Foundation, en.wikipedia.org/wiki/Linear_discriminant_analysis.

[4] Starmer, StatQuest with Josh. “StatQuest: Linear Discriminant Analysis (LDA) Clearly Explained.” YouTube, YouTube, www.youtube.com/watch?v=azXCzI57Yfc.

--

--