ML model data prep series

Dimensionality Reduction Techniques — PCA, LCA and SVD

Indraneel Dutta Baruah
Nerd For Tech
Published in
15 min readOct 7, 2023

--

Let’s learn about PCA, LCA, and SVD. Their pros, cons, and when to use along with their Python implementation.

Dimensionality reduction plays a pivotal role in data analysis and machine learning, offering a strategic solution to the challenges posed by high-dimensional datasets. As datasets grow in size and complexity, the number of features or dimensions often becomes unwieldy, leading to increased computational demands, potential overfitting, and diminished model interpretability. Dimensionality reduction techniques provide a remedy by capturing the essential information within the data while discarding redundant or less informative features. This process not only streamlines computational tasks but also aids in visualizing data trends, mitigating the risk of the curse of dimensionality, and improving the generalization performance of machine learning models. Dimensionality reduction finds applications across various domains, from image and speech processing to finance and bioinformatics, where extracting meaningful patterns from vast datasets is crucial for making informed decisions and building effective predictive models.

In this blog, we will delve into three powerful dimensionality reduction techniques — Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Singular Value Decomposition (SVD). Our exploration will not only elucidate the underlying algorithms of these methods but also provide their respective advantages and disadvantages. We will accompany theoretical discussions with practical implementations in Python, offering hands-on guidance for applying PCA, LDA, and SVD to real-world datasets. Whether you’re a novice seeking an introduction to dimensionality reduction or a seasoned practitioner looking to enhance your understanding, this blog is crafted to cater to all levels of expertise.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique widely used in data analysis and machine learning. Its primary goal is to transform high-dimensional data into a lower-dimensional representation, capturing the most important information.

Here’s the motivation for PCA:

As we aim to identify patterns within datasets, it’s desirable for the data to be distributed across each dimension, and we seek independence among these dimensions. Let’s revisit some fundamental concepts. Variance serves as a metric for the variability, essentially quantifying the extent to which the dataset is dispersed. In mathematical terms, it represents the average squared deviation from the mean score. The formula employed to calculate variance, denoted as var(x), is expressed as follows.

Covariance quantifies the degree to which corresponding elements in two sets of ordered data exhibit similar directional movement. The formula, represented as cov(x, y), captures the covariance between variables x and y. In this context, xi represents the value of x in the ith dimension, while the x bar and y bar denote their respective mean values. Now, let’s explore this concept in matrix format. If we have a matrix X with dimensions m*n, containing n data points each with m dimensions, the covariance matrix can be computed as follows:

Please note that the covariance matrix contains -
1. variance of dimensions as the main diagonal elements
2. covariance of dimensions as the off-diagonal elements

As mentioned earlier, our goal is to ensure that the data is widely dispersed, indicating high variance across its dimensions. Additionally, we aim to eliminate correlated dimensions, meaning that the covariance among dimensions should be zero, signifying their linear independence. Consequently, the aim is to undergo a data transformation so that its covariance matrix exhibits the following characteristics:
1. Significant values as the main diagonal elements.
2. Zero values as the off-diagonal elements.
Hence, the original data points must be transformed to achieve a covariance matrix resembling a diagonal matrix. This process of transforming a matrix into a diagonal matrix is referred to as diagonalization, and it constitutes the primary motivation behind Principal Component Analysis (PCA).

Here’s how PCA works:

1. Standardization

Standardize the data when features are measured in diverse units. This entails subtracting the mean and dividing by the standard deviation for each feature. Failure to standardize data with features of varying scales can result in misleading components.

2. Compute the Covariance Matrix

Calculate the covariance matrix as discussed earlier

3. Calculate Eigenvectors and Eigenvalues

Determine the eigenvectors and eigenvalues of the covariance matrix.

Eigenvectors represent the directions (principal components), and eigenvalues represent the magnitude of variance in those directions. To understand what eigenvectors and eigenvalues are, you can go through this video:

4. Sort Eigenvalues

Sort the eigenvalues in descending order. The eigenvectors corresponding to the highest eigenvalues are the principal components that capture the most variance in the data. To understand why, please refer to this blog.

5. Select Principal Components

Choose the top k eigenvectors (principal components) based on the explained variance needed. Typically, you aim to retain a significant portion of the total variance, like 85%. How explained variance is calculated can be found here.

6. Transform the Data

Now, we can transform the original data using the eigenvectors:

So, if we have m dimensional original n data points then
X : m*n
P : k*m
Y = PX : (k*m)(m*n) = (k*n)
Hence, our new transformed matrix has n data points having k dimensions.

Pros:

1. Dimensionality Reduction:
PCA effectively reduces the number of features, which is beneficial for models that suffer from the curse of dimensionality.

2. Feature Independence:
Principal components are orthogonal (uncorrelated), meaning they capture independent information, simplifying the interpretation of the reduced features.

3. Noise Reduction:
PCA can help reduce noise by focusing on the components that explain the most significant variance in the data.

4. Visualization:
The reduced-dimensional data can be visualized, aiding in understanding the underlying structure and patterns.

Cons:

1. Loss of Interpretability:
Interpretability of the original features may be lost in the transformed space, as principal components are linear combinations of the original features.

2. Assumption of Linearity:
PCA assumes that the relationships between variables are linear, which may not be true in all cases.

3. Sensitive to Scaling:
PCA is sensitive to the scale of the features, so standardization is often required.

4. Outliers Impact Results:
Outliers can significantly impact the results of PCA, as it focuses on capturing the maximum variance, which may be influenced by extreme values.

When to Use:

1. High-Dimensional Data:
PCA is particularly useful when dealing with datasets with a large number of features to mitigate the curse of dimensionality.

2. Collinear Features:
When features are highly correlated, PCA can be effective in capturing the shared information and representing it with fewer components.

3. Visualization:
PCA is beneficial when visualizing high-dimensional data is challenging. It projects data into a lower-dimensional space that can be easily visualized.

5. Linear Relationships:
When the relationships between variables are mostly linear, PCA is a suitable technique.

Python Implementation

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load iris dataset as an example
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data (important for PCA)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

# Apply PCA
pca = PCA()
X_train_pca = pca.fit_transform(X_train_std)

# Calculate the cumulative explained variance
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Determine the number of components to keep for 85% variance explained
n_components = np.argmax(cumulative_variance_ratio >= 0.85) + 1

# Apply PCA with the selected number of components
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

# Display the results
print("Original Training Data Shape:", X_train.shape)
print("Reduced Training Data Shape (PCA):", X_train_pca.shape)
print("Number of Components Selected:", n_components)

In this example, the PCA() is initially applied without specifying the number of components, which means it will keep all components. Then, the cumulative explained variance is calculated using np.cumsum(pca.explained_variance_ratio_). Finally, the number of components required to explain at least 85% of the variance is determined, and PCA is applied again with the selected number of components. Please note PCA is fit only on training data and then it is used to transform the test data.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) serves as a technique for both dimensionality reduction and classification, aiming to optimize the distinction between various classes within a dataset. LDA is particularly prevalent in supervised learning scenarios where the classes of data points are predetermined. While PCA is considered an “unsupervised” algorithm that disregards class labels, focusing on finding principal components to maximize dataset variance, LDA takes a “supervised” approach. LDA computes “linear discriminants,” determining the directions that serve as axes to maximize separation between multiple classes. To delve into the workings of LDA, let’s understand how LDA is calculated using an example of the famous “Iris” dataset on the UCI machine learning repository. It contains measurements for 150 iris flowers from three different species.

There are three classes in the Iris dataset:

  1. Iris-setosa (n=50)
  2. Iris-versicolor (n=50)
  3. Iris-virginica (n=50)

There are four features in the Iris dataset:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm

Steps in LDA:

  1. We will start off with the computation of the mean vectors mi, (i=1,2,3) of the 3 different flower classes:
Mean Vector class 1: [ 5.006  3.418  1.464  0.244]

Mean Vector class 2: [ 5.936 2.77 4.26 1.326]

Mean Vector class 3: [ 6.588 2.974 5.552 2.026]

Each vector contains the mean of the 4 features in the dataset for the specific class.

2. Compute the within-class scatter matrix (Sw), which represents the spread of data within each class:

In our example, it will look like the following:

within-class Scatter Matrix:
[[ 38.9562 13.683 24.614 5.6556]
[ 13.683 17.035 8.12 4.9132]
[ 24.614 8.12 27.22 6.2536]
[ 5.6556 4.9132 6.2536 6.1756]]

3. Compute the between-class scatter matrix (Sb), which represents the spread between different classes using the following formula:

In our example, it will look like the following:

between-class Scatter Matrix:
[[ 63.2121 -19.534 165.1647 71.3631]
[ -19.534 10.9776 -56.0552 -22.4924]
[ 165.1647 -56.0552 436.6437 186.9081]
[ 71.3631 -22.4924 186.9081 80.6041]]

4. Compute the Eigenvalues and Eigenvectors of Sw-¹Sb (similar to PCA). In our case, we have 4 eigenvalues and eigenvectors:

Eigenvector 1:
[[-0.2049]
[-0.3871]
[ 0.5465]
[ 0.7138]]
Eigenvalue 1: 3.23e+01

Eigenvector 2:
[[-0.009 ]
[-0.589 ]
[ 0.2543]
[-0.767 ]]
Eigenvalue 2: 2.78e-01

Eigenvector 3:
[[ 0.179 ]
[-0.3178]
[-0.3658]
[ 0.6011]]
Eigenvalue 3: -4.02e-17

Eigenvector 4:
[[ 0.179 ]
[-0.3178]
[-0.3658]
[ 0.6011]]
Eigenvalue 4: -4.02e-17

5. Sort the eigenvectors by decreasing eigenvalues and pick the top k. After sorting the eigenpairs by decreasing eigenvalues, it is now time to construct our d×k dimensional eigenvector matrix (let’s call it W) based on the 2 most informative eigenpairs. We get the following matrix in our example:

Matrix W:
[[-0.2049 -0.009 ]
[-0.3871 -0.589 ]
[ 0.5465 0.2543]
[ 0.7138 -0.767 ]]

6. Use the matrix W (4 X 2 matrix) to transform our samples onto the new subspace via the equation: Y = X*W, where X is the original data frame in matrix format (150 X 4 matrix in our case) and Y is the transformed dataset (150 X 2 matrix). Please refer to this blog for more details.

Pros:

1. Maximizes Class Separation:
LDA is designed to maximize the separation between different classes, making it effective for classification tasks.

2. Dimensionality Reduction:
Like PCA, LDA can be used for dimensionality reduction, but with the advantage of considering class information.

Cons:

1. Sensitivity to Outliers:
LDA is sensitive to outliers, and the presence of outliers can affect the performance of the method.

2. Assumption of Normality:
LDA assumes that the features within each class are normally distributed, and it may not perform well if this assumption is violated.

3. Requires Sufficient Samples:
LDA may not perform well with a small number of samples per class. Having more samples improves the estimation of class parameters.

When to Use:

1. Classification Tasks
LDA is beneficial when the goal is to classify data into predefined classes.

2. Preserving Class Information:
When the goal is to reduce dimensionality while preserving information that is relevant for discriminating between classes.

3. Normality Assumption Holds:
LDA performs well when the assumption of normal distribution within each class is valid.

4. Supervised Dimensionality Reduction:
When the task requires dimensionality reduction with the guidance of class labels, LDA is a suitable choice.

Python Implementation

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Generate a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for LDA)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize LDA and fit on the training data
lda = LinearDiscriminantAnalysis()
X_train_lda = lda.fit_transform(X_train, y_train)

# Calculate explained variance ratio for each component
explained_variance_ratio = lda.explained_variance_ratio_

# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

# Find the number of components that explain at least 75% of the variance
n_components = np.argmax(cumulative_explained_variance >= 0.75) + 1

# Transform both the training and test data to the selected number of components
X_train_lda_selected = lda.transform(X_train)[:, :n_components]
X_test_lda_selected = lda.transform(X_test)[:, :n_components]

# Print the number of components selected
print(f"Number of components selected: {n_components}")

# Now, X_train_lda_selected and X_test_lda_selected can be used for further analysis or modeling

This example uses the make_classification function from scikit-learn to generate a synthetic dataset, then splits the data into training and test sets. It standardizes the features, initializes the LDA model, and fits it into the training data. Finally, it selects the number of components based on the desired variance explained and transforms the training and test data accordingly.

Singular Value Decomposition (SVD)

Singular Value Decomposition is a matrix factorization technique widely used in various applications, including linear algebra, signal processing, and machine learning. It decomposes a matrix into three other matrices, allowing for the representation of the original matrix in a reduced form. The decomposition techniques and proofs are explained here.

Steps in SVD:

1. Decomposition of Matrix
Given a matrix M of size m x n (or a data frame with m rows and n columns), SVD decomposes it into three matrices:
M = U *Σ *Vᵗ,
where U is an m x m orthogonal matrix, Σ is an m x r diagonal matrix, and V is an r x n orthogonal matrix. r is the rank of the matrix M.
The diagonal elements of Σ are the singular values of the original matrix M, and they are arranged in descending order. The columns of U are the left singular vectors of M. These vectors form an orthogonal basis for the column space of M. The columns of V are the right singular vectors of M. These vectors form an orthogonal basis for the row space of M. Please read this to dive deep into the maths behind it.

2. Reduced Form (Truncated SVD)
For dimensionality reduction, a truncated version of SVD is often used. Select the top k largest singular values in Σ. These columns can be selected from Σ and the rows selected from Vᵗ. A new matrix B can be reconstructed from the original matrix M using the following formula:

B = U * Σ
B = Vᵗ * A, where Σ only contains the top k columns in the original Σ based on singular values and Vᵗ contains the top k rows of the original Vᵗ corresponding to the singular values. For more details, you can refer here.

Pros:

1. Dimensionality Reduction
SVD allows for dimensionality reduction by retaining only the most significant singular values and vectors.

2. Data Compression
SVD is used in data compression tasks, reducing the storage requirements of a matrix.

3. Noise Reduction
By using only the most significant singular values, SVD can help reduce the impact of noise in the data.

4. Numerical Stability
SVD is numerically stable and well-suited for solving linear equations in ill-conditioned systems.

5. Orthogonality
The matrices U and V in the SVD decomposition are orthogonal, preserving the relationships between the rows and columns of the original matrix.

6. Applications in Recommender Systems
SVD is widely used in collaborative filtering for recommender systems.

Cons:

1. Computational Complexity:
Computing the full SVD for large matrices can be computationally expensive.

2. Memory Requirements:
Storing the full matrices U, Σ, and V can be memory-intensive, especially for large matrices.

3. Sensitivity to Missing Values:
SVD is sensitive to missing values in the data, and handling missing values requires specialized techniques.

When to Use SVD:

1. Dimensionality Reduction:
When the goal is to reduce the dimensionality of the data while preserving its essential structure.

2. Recommender Systems:
In collaborative filtering-based recommender systems, SVD is used to identify latent factors that capture user-item interactions.

3. Data Compression:
In scenarios where large datasets need to be compressed or approximated.

4. Numerical Stability:
When solving linear equations in ill-conditioned systems, SVD provides numerical stability.

5. Signal Processing:
In signal processing, SVD is used for noise reduction and feature extraction.

6. Topic Modeling:
SVD is employed in topic modeling techniques such as Latent Semantic Analysis (LSA).

Python Implementation

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Generate a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for SVD)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize SVD and fit on the training data
svd = TruncatedSVD(n_components=X_train.shape[1] - 1) # Use one less component than the feature count
X_train_svd = svd.fit_transform(X_train)

# Calculate explained variance ratio for each component
explained_variance_ratio = svd.explained_variance_ratio_

# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

# Find the number of components that explain at least 75% of the variance
n_components = np.argmax(cumulative_explained_variance >= 0.75) + 1

# Transform both the training and test data to the selected number of components
X_train_svd_selected = svd.transform(X_train)[:, :n_components]
X_test_svd_selected = svd.transform(X_test)[:, :n_components]

# Print the number of components selected
print(f"Number of components selected: {n_components}")

# Now, X_train_svd_selected and X_test_svd_selected can be used for further analysis or modeling

This example uses the make_classification function to generate a synthetic dataset, split the data into training and test sets, and standardize the features. It then initializes the TruncatedSVD model, fits it on the training data, and selects the number of components based on the desired variance explained. Finally, it transforms both the training and test data accordingly.

Conclusion

The choice between Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Singular Value Decomposition (SVD) depends on the specific objectives and characteristics of the data. Here are general guidelines on when to use each technique:

1. PCA (Principal Component Analysis)

Use Cases:
1. When the goal is to reduce the dimensionality of the dataset.
2. In scenarios where capturing global patterns and relationships within the data is crucial.
3. For exploratory data analysis and visualization.

2. LDA (Linear Discriminant Analysis)

Use Cases:
1. In classification problems where enhancing the separation between classes is important.
2. When there is a labeled dataset, and the goal is to find a projection that maximizes class discrimination.
3. LDA is particularly effective when the assumption of normally distributed classes and equal covariance matrices holds.

3. SVD (Singular Value Decomposition)

Use Cases:
1. When dealing with sparse data or missing values.
2. In collaborative filtering for recommendation systems.
3. SVD is also applicable in data compression and denoising.

We should have the following considerations:
Unsupervised vs Supervised Learning: PCA is unsupervised, while LDA is supervised. Choose based on the availability of labeled data.

Class Separability: If the goal is to improve class separability, LDA is preferred. PCA and SVD focus on overall variance.

Data Characteristics: The characteristics of your data, such as linearity, class distribution, and presence of outliers, influence the choice.

Application-Specific Requirements: Consider the specific requirements of your application, such as interpretability, computational efficiency, or handling of missing data.

In summary, PCA is suitable for unsupervised dimensionality reduction, LDA is effective for supervised problems with a focus on class separability, and SVD is versatile, catering to various applications including collaborative filtering and matrix factorization. The choice depends on the nature of your data and the goals of your analysis.

Do you have any questions or suggestions about this blog? Please feel free to drop in a note.

Thank you for reading!

If you, like me, are passionate about AI, Data Science, or Economics, please feel free to add/follow me on LinkedIn, Github and Medium.

--

--

Indraneel Dutta Baruah
Nerd For Tech

Striving for excellence in solving business problems using AI!