Understanding The Math Behind Dimension Reduction in Facial Recognition(1)

Read my beginner-friendly proofs to explore the application of linear algebra

Published in

The Startup

10 min readJun 5, 2019

Dimension reduction is a common technique in computer science that most of you have probably run into in some way or another, but not everyone understands how it is deduced mathematically. In this series of articles, I will walk you through the linear algebra behind dimension reduction through a simple example of facial recognition. Besides understanding the maths, you will also see how dimension reduction can be helpful in simplifying the task of image recognition. In particular, I will talk about eigenvectors/eigenvalues, the Spectral Theorem and Principal Component Analysis.

Why Dimension Reduction is Needed

To refresh your memory on dimension reduction, let’s start with a hypothetical example. The Communication Department of Middlebury College stores 3,000 photos taken from each Middlebury College student. The Department Director recently found a picture of a student on a local newspaper and she wanted to know if that student belongs to Middlebury.

Designing an efficient model for facial recognition is hard, thanks to the multidimensional nature of images. If we confine the scope of our discussion to black-and-white images, we can think of an image of m × n pixels as a matrix filled with real numbers between 0(black) and 255(white). Each matrix has a dimension of m × n, with each pixel representing a dimension.

If we assume each student photo is 1,600 × 1,200 in dimension, and we compare the new image with each of the 3,000 photos pixels by pixels, we would have to do 3,000 × 1,600 × 1,200 = 5,760,000,000 comparisons. Our goal is to reduce the dimension of student photos without losing(or with minimized loss of) variance captured in them. Before I delve into the specific steps, I want to first introduce the basic concept of variance.

An Intuitive Overview of Variance

Variance is the average of the squared difference from the mean. For instance, given a list of real numbers x_1...x_n, We define the variance as:

In this application, variance is what distinguishes a person’s flat nose from another person’s hooked nose. We want variance among the 3,000 photos of Middlebury students to be as high as possible, since high variance improves accuracy of recognition. Intuitively, variance grows with the number of dimensions. It is easier to identify the photo of your friend if you are given a complete image rather than just pixels around the nose area. Yet some dimensions may be repetitive. For example, the pixels next to each other may have very similar values, as they capture the same kind of variance of the face.

Recall that in our example, the Director started with 1,600 × 1,200 = 1,920,000 dimensions. In the remaining part of this article, I will demonstrate how a change of basis can help us shrink the number of dimensions to t < 1,920,00 without losing any variance. In the next article, I will demonstrate how Principal Component Analysis can further reduce the number of dimensions to m ≪ t while preserving most of the variance.

Construction of Covariance Matrix

Let’s start with constructing a real vector space that contains all student images. We convert each matrix of student image into a row vector by concatenating its pixels rows by rows. For example, if K₁ represents the image of student 1, then:

Next, we combine the row vectors k₁ … k₃₀₀₀ together to form a 3,000 × 1,920,000 matrix K_all. Each column of K_all represents a dimension(corresponding to a pixel) and each row represents a face vector kᵢ. Together all the row vectors of K_all form our face space ℝ¹⁹²⁰⁰⁰⁰:

To underscore the differences across images and to simplify future computations, we first compute the average face vector k̄, i.e. the mean value of each pixel across all 3,000 face images. Then we normalize each face vector kᵢ by subtracting the average face vector k̄ from this vector. Let vᵢ be the iᵗʰ face difference vector after normalization, then:

Similarly, we combine the face difference vectors v₁ … v₃₀₀₀ to form the variance matrix A. Each column of A represents a dimension and each row represents a face difference vector:

To maximize variance with the fewest dimensions, we want each dimension to be independent, i.e. the variance captured in dimension x should not overlap with the variance captured in dimension y. In mathematics, covariance is the concept that measures the joint variability of two dimensions. The covariance formula states that:

equation 1

In our case, since we have normalized our face vectors x̄ = ȳ = 0 , we can directly compute the covariance between dimension x and y through the dot product of columns x and y of the variance matrix A. Let d_x and d_y denote the xᵗʰ and the yᵗʰ column of A(corresponding to pixel x and y). Also let α be ⅟3000, the inverse of the number of rows in A, then:

equation 2

To lay out the covariance between every two dimensions in matrix A, we construct the covariance matrix C. Let dᵢ denote the iᵗʰ column of matrix A:

Combine equation 2 and 3, we have:

The covariance matrix C is symmetric, as we can infer from equation 2 that:

The diagonal entries of C indicate the variance on each dimension.

Change to Orthogonal Basis: The Spectral Theorem

Once we have constructed the covariance matrix C, we want to find a new orthogonal basis under which Cov(x, y) = 0, for any x and y represented in the new basis and x ≠ y. This new basis is optimal since the utility of variance on each dimension would be maximized, as there would be no overlap of variance across dimensions. Thanks to the Spectral Theorem and the symmetry of C we have shown, we can perform a change of basis on C:

equation 5

where D is a real diagonal matrix and Q is an orthogonal matrix. The eigenvalues of C appears on the diagonal of D. The columns of Q, q₁, q₂, …, q₁₉₂₀₀₀₀ are the corresponding orthonormal eigenvectors that compose the new orthonormal basis we desire.

Through a change of basis, q₁, q₂, …, q₁₉₂₀₀₀₀ now form a new orthonormal basis of our face space, with the covariance between any two dimensions in this basis being 0. According to the Spectral Theorem, this orthogonal basis happens to be an eigenbasis. We can verify that this eigenbasis is orthogonal because D is a diagonal matrix, which renders Cov(x,y) = 0, for any x and y represented in the basis and x ≠ y.

So far we have eliminated covariance between dimensions. Yet careful readers may notice that we still have the same number of dimensions, since the new orthogonal basis consists of 1,920,000 eigenvectors and each eigenvector spans a dimension. If so, how does constructing an orthogonal eigenbasis help us in dimension reduction? It turns out that we can safely remove all eigenvectors from the orthogonal eigenbasis if these eigenvectors correspond to zero eigenvalue, since they do not contribute to the total variance. To verify this claim, I will proceed to prove that the variance on the dimension spanned by each eigenvector is equal to the eigenvalue associated with that eigenvector (Lemma 1), so that zero eigenvalue indicates zero variance. If you don’t want to see the detailed proof, you can jump to the next section.

Lemma 1: Let u be an eigenvector from the orthonormal eigenbasis of matrix C= α AᵗA, where α is the inverse of the number of rows in A. Let λ be the corresponding eigenvalue of u. Let the v_1...v_n be the normalized row vectors of A. Let σᵤ² be the variance of v_1...v_n on the dimension spanned by u. We claim that σᵤ² = λ.

To prove Lemma 1, let pᵢ be the projected weight of vᵢ onto the dimension spanned by u, According to the orthogonal projection formula, under the Euclidean inner product:

Let p be the projected measurement vector that contains p_1...p_n :

Since v_1...v_n have been normalized, v_1 + ... + v_n = 0. Therefore:

According to the variance formula:

Combine equation 6 and 8, we have:

equation 9

Recall that for any two vector v and w of the same length:

equation 10

Combine equation 9 and 10, we have:

Combine equation 9 and 11, we have:

Having proved Lemma 1, we now confirmed that the removal of eigenvectors with zero eigenvalue would not affect variance. Moreover, the total variance captured in the face space is the sum of all eigenvalues of C.

Conversion between AᵗA and AAᵗ

So far we have proved that eigenvectors with zero eigenvalue do not contribute to variance. If we remove t eigenvectors with zero eigenvalue from the eigenbasis of C, the subspace T formed by the remaining eigenvectors would only have a dimension of 1,920,000 — t , but T would capture the same amount of variance as our original face space, which has a dimension of 1,920,000. The question is: how do we maximize t and thus reduce as many dimensions as possible?

Let Cᵗ = αAAᵗ. We notice Cᵗ only has a dimension of 3000 × 3000, significantly smaller than that of C = αAᵗA. Thus we proceed by exploring the relationship between Cᵗ and C. According to the Spectral Theorem, Cᵗ also possesses an orthonormal eigenbasis, for it is clear that Cᵗ is symmetric:

If we can prove that every eigenvector in C with a non-zero eigenvalue has a corresponding eigenvector in Cᵗ with the same eigenvalue, we can show that the sum of eigenvalues in C is the same as the sum of eigenvalues in the much smaller Cᵗ. In light of Lemma 1, we can demonstrate that the eigenbasis of Cᵗ captures the same amount of variance as the eigenbasis of C. Hence, we can maximize t up to 1,920,000–3,000 = 1,917,000. Consequently, we can form a subspace T of dimension 3,000, which carries the same amount of variance with our original 1,920,000-dimensional face space.

We proceed to prove the above claim that every eigenvector in C with a non-zero eigenvalue has a corresponding eigenvector in Cᵗ with the same eigenvalue:

Lemma 2: If q is an eigenvector of α AᵗA (α ≠ 0) with a non-zero eigenvalue λ, then Aq is an eigenvector of α AAᵗ with the same eigenvalue λ.

To prove lemma 2, let q be an eigenvector of α AᵗA that corresponds to a non-zero eigenvalue λ:

We still need to prove that Aq ≠ 0. If Aq = 0, then:

However, the fact that q is an eigenvector with a non-zero eigenvalue λ implies that q ≠ 0 and λ ≠ 0. Hence a contradiction has arisen. Therefore, Aq≠ 0. The proof is complete.

By the same logic, if q is an eigenvector of Cᵗ, then Aᵗq is the corresponding eigenvector of C, since (αAᵗA)Aᵗq = Aᵗ(αAAᵗ)q = Aᵗλq = λ(Aᵗq). Now let x₁ … x₃₀₀₀ be the eigenbasis of the much smaller Cᵗ. In light of Lemma 2, we can convert x₁ … x₃₀₀₀ back to eigenvectors in C by multiplying each with Aᵗ. Let tᵢ be the eigenvector in C that share the same eigenvalue with xᵢ in Cᵗ, then:

Equation 12

As a result, t₁ … t₃₀₀₀form the subspace T we introduced before that captures the same amount of variance as our original face space. If our goal is to reduce dimension without losing any variance. we can stop here and let T be our final eigenface subspace. We call each eigenvector t₁ … t₃₀₀₀ an eigenface.

Summary and Preview of The Next Article

In this article, our goal was to preserve the entire variance of the face space. As a trade-off, we still needed to include dimensions with relatively small variances, which increased computational difficulty. If we are willing to sacrifice minor loss of variance in exchange for major gains of computational efficiency, we can further reduce the dimension of the eigenface subspace by only selecting dimensions with the largest variances. In the next article, I will introduce how to do that through Principal Component Analysis (PCA). Like this article, I will focus on the mathematical proof of PCA. To conclude this series, I will also talk about how to project original student photos onto the eigenface subspaces that we will create through PCA.

Reference:

[1] j2kun. Eigenfaces, for Facial Recognition. https://jeremykun.com/2011/07/27/eigenfaces .

[2] Peter Olver and Chehrzad Shakiban. Applied Linear Algebra. Springer, 2006

[3] Matthew Turk and Alex Pentland. Eigenfaces for Recognition. pages 71–86. Journal of Cogni-tive Neuroscience, 1991.

[4] Cover photo credited to https://yutsumura.com/