Uncovering Patterns and Simplifying Complexity: A Guide to PCA

BernaYilmaz
12 min readJun 20, 2024

Introduction

In the rapidly evolving field of data science, simplifying complex data while retaining essential information is crucial. Dimensionality reduction is a key technique for achieving this, making data analysis more efficient and insightful. As datasets grow larger and more complex, the challenges of high-dimensional data — often termed the “curse of dimensionality” — become increasingly significant. This complexity not only hampers data visualization and exploration but also leads to computational inefficiencies and potential model overfitting.

Among the strategies to address high-dimensional data, Principal Component Analysis (PCA) stands out as one of the most effective. PCA transforms the original data into orthogonal components that capture the maximum variance, making the data interpretation more meaningful. This article explores PCA, discussing its methodologies, applications, and the intuitive concepts that make it a powerful tool. By the end of this discussion, readers will understand how PCA can uncover hidden patterns, enhance data processing, and support informed decision-making across various domains.

Principal Component Analysis (PCA)

What is PCA?

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of values of linearly uncorrelated variables. These new variables, called principal components, are derived in such a way that the first principal component has the highest possible variance (it accounts for as much of the variability in the data as possible), and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components. This transformation redefines the data, providing a new basis for analysis which can simplify the complexity inherent in large datasets.

data = pd.read_csv("drug200.csv.xls")
data.head()
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1] # Labels

print("Unique values before factorization:", y.unique())

# Factorize 'y'
y, mapping = pd.factorize(y)

# Print the mapping between string values and integer values
print("Mapping:")
for index, value in enumerate(mapping):
print(f"'{value}' --> {index}")

# Encode the categorical features: Sex, BP, and Cholesterol
categorical_features = ["Sex", "BP", "Cholesterol"]
encoder = LabelEncoder()

for feature in categorical_features:
X[feature] = encoder.fit_transform(X[feature])

How PCA Works:

The mathematics behind PCA involves several key steps, centered around the covariance matrix of the data. Here’s how it works:

Standardization:

Before applying PCA, it is common practice to standardize the data. This involves transforming each feature to have zero mean and unit variance. The standardization for a feature x can be represented as:

where μ is the mean and σ is the standard deviation of the feature x.

# Standardizing the data
X_standardized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

Standardization is crucial in PCA for the following reasons:

  1. Equal Weighting: Features with larger scales can dominate the principal components. Standardizing ensures each feature contributes equally.
  2. Comparable Units: Standardization transforms features to a common scale (mean = 0, standard deviation = 1), making them comparable.
  3. Enhanced Convergence: Algorithms converge faster on standardized data, improving computational efficiency.
  4. Accurate Covariance: Standardization ensures the covariance matrix reflects true relationships among features, not biased by differing scales.

Covariance Matrix Computation:

PCA computes the covariance matrix of the data, which helps in understanding how the variables in the dataset vary from the mean with respect to each other.

The covariance matrix, Σ, of a dataset provides insights into how each pair of variables in the dataset varies from the mean with respect to each other. If X is a matrix representing our standardized data (with each row representing a sample and each column a feature), then the covariance matrix is given by:

where n is the number of data points, X^T is the transpose of X, and X is the data matrix.

Considering our dataset, the below is our covariance matrix:

Variance

The variance of a feature F is calculated as:

where:

  • n is the number of data points
  • Fi​ is the i-th data point of feature F
  • is the mean of feature F

Covariance

The covariance between two features F1 and F2 is calculated as:

where:

  • n is the number of data points
  • F1i​ is the i-th data point of feature F1
  • F2i​ is the i-th data point of feature F2
  • F1ˉ is the mean of feature F1
  • F2ˉ is the mean of feature F2

Positive Covariance: Indicates that as one variable increases, the other variable tends to increase.

  • Negative Covariance: Indicates that as one variable increases, the other variable tends to decrease.
  • Variance: Measures the spread of each variable individually.

This structure will help you understand the relationships between different features in your dataset and how they vary with each other.

# Computing the covariance matrix
cov_matrix = np.cov(X_standardized, rowvar=False)
cov_matrix

From our covariance matrix,

  • The diagonal elements (1.005) represent the variances of each standardized feature. Since the data is standardized, these values are approximately 1, confirming that the standard deviation is 1 and the mean is 0 for each feature.
  • The covariances are generally small, indicating weak linear relationships between most pairs of features.
  • The most notable covariance is between BP and Cholesterol (-0.1382), suggesting some level of inverse relationship.
  • These insights can help in understanding the interdependencies among features, which is useful in feature selection and data preprocessing steps before applying PCA or other dimensionality reduction techniques.

Eigendecomposition:

The next step is calculating the eigenvectors and eigenvalues of this covariance matrix. The eigenvectors(v) determine the directions of the new feature space (principal components) and the eigenvalues(λ) determine their magnitude (the amount of variance each principal component carries).

  • In essence, the eigenvectors represent the principal components (directions of maximum variance), while the eigenvalues represent the amount of variance carried in each principal component.
  • For the covariance matrix Σ, we solve for the vectors v that satisfy the below equation. The eigenvectors are sorted by their corresponding eigenvalues in descending order.

Here, λ represents the eigenvalue and v represents the eigenvector associated with λ. This equation states that for each eigenvalue, there is a direction (eigenvector) along which the transformation Σ acts by merely scaling (by the factor λ).
The eigenvalues are obtained by solving the determinant of Σ−λI equals zero:

Expanding this determinant gives you a polynomial in λ (called the characteristic polynomial), and the roots of this polynomial are the eigenvalues. Once the eigenvalues are known, they are substituted back into the equation Σv=λv to solve for the eigenvectors.

# Performing eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

( If you want to look at how these calculations are made in more detail from the perspective of linear algebra and better understand the concepts of eigenvalue and eigenvector, I recommend you watch Prof. Gilbert Strang’s lecture recordings on YouTube :) :)

https://www.youtube.com/watch?v=cdZnhQjJu4I )

Component Selection:

To reduce the dimensionality, we select the top k (here k=3) eigenvectors, which correspond to the largest k eigenvalues. This selection captures the most significant variance and information within the dataset. If we denote the matrix of selected eigenvectors as Vk​, then the transformed data T is given by:

Here, T represents the data transformed into the new feature space with reduced dimensions, and Vk​ contains the eigenvectors corresponding to the top k eigenvalues.

# Sorting the eigenvalues and eigenvectors in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[sorted_indices]
sorted_eigenvectors = eigenvectors[:, sorted_indices]
# Displaying eigenvalues and eigenvectors
print("Eigenvalues and corresponding eigenvectors (sorted in descending order of eigenvalues):")
for i, eigenvalue in enumerate(sorted_eigenvalues):
print(f"Eigenvalue {i+1}: {eigenvalue:.4f}")
print(f"Eigenvector {i+1}: {sorted_eigenvectors[:, i]}")
print()

# Extracting the top three principal components
top_eigenvalues = sorted_eigenvalues[:3]
top_eigenvectors = sorted_eigenvectors[:, :3]

# Calculating loadings
loadings = top_eigenvectors * np.sqrt(top_eigenvalues)

# Displaying loadings and feature contributions to each principal component
print("Loadings for the top 3 principal components:")
for i in range(3):
print(f"Principal Component {i+1}:")
for j in range(loadings.shape[0]):
print(f" {feature_names[j]}: {loadings[j, i]:.4f}")
print(f"Total features contributing to PC{i+1}: {loadings.shape[0]}") # Number of features in each PC
print()

The eigenvalues and corresponding eigenvectors provide insight into the principal components and the variance they explain.

  1. Eigenvalue 1: 1.2984 — This component explains the largest portion of the variance.
  2. Eigenvalue 2: 1.0818 — The second principal component, explaining the next largest variance.
  3. Eigenvalue 3: 0.9867
  4. Eigenvalue 4: 0.8820
  5. Eigenvalue 5: 0.7763 — The least significant in terms of variance explained.

Each eigenvector corresponds to a principal component and indicates the direction in which the data varies the most.

  • Eigenvector 1: Dominated by high absolute values across the features, indicating a strong influence from multiple features.
  • Eigenvector 2: Significant contributions from features with negative and positive values, showing diverse directionality.
  • Eigenvector 3–5: Contribute to capturing the variance orthogonal to the previous components, focusing on remaining patterns in the data.

Hence, the eigenvalues indicate how much variance each principal component captures from the original data. Eigenvector coefficients show the weight of each feature in the principal component, guiding interpretation of feature importance.

These steps form the core of PCA, allowing for a mathematical transformation of the data from a high-dimensional space to a lower-dimensional space that captures the most significant variance directions.

Axes: The axes, labeled as PC1, PC2, and PC3, correspond to the first, second, and third principal components respectively. These components are the new features created by PCA, derived from the original data’s features but transformed to highlight the directions of maximum variance.

PC1: It is influenced by several features, not just ‘Age’. According to the loadings, ‘Na_to_K’ has the highest positive loading (0.6007), followed by ‘Cholesterol’ (0.4229), while ‘Age’, ‘Sex’, and ‘BP’ have negative loadings. So, ‘Na_to_K’ is the most significant feature for PC1.

PC2: It has contributions from ‘Sex’ (-0.6533), ‘Cholesterol’ (-0.5679), and ‘BP’ (0.4813). ‘Sex’ has the highest negative loading, indicating it accounts for a significant portion of the variance orthogonal to PC1.

PC3: It is influenced by ‘Age’ (-0.6063), ‘Na_to_K’ (-0.5638), and ‘Cholesterol’ (0.4234). ‘Age’ has the highest negative loading, capturing additional variance orthogonal to both PC1 and PC2.

In summary:

  • PC1 is most strongly influenced by ‘Na_to_K’.
  • PC2 is most influenced by ‘Sex’.
  • PC3 is most influenced by ‘Age’.

Therefore, each principal component captures different aspects of the data variance, making PCA a powerful tool for dimensionality reduction and revealing underlying patterns in the data.


# Calculate total variance explained by the top three components
total_variance = sum(eigenvalues)
variance_explained = [(i / total_variance)*100 for i in sorted_eigenvalues[:3]]
total_variance_explained = sum(variance_explained)

# Display the variance explained
print("Variance explained by the top 3 components:")
print(f"PC1: {variance_explained[0]:.2f}%")
print(f"PC2: {variance_explained[1]:.2f}%")
print(f"PC3: {variance_explained[2]:.2f}%")
print(f"Total Variance Explained by Top 3 Components: {total_variance_explained:.2f}%")
# Make sure the length of 'y' matches the number of samples in 'X_standardized'
if len(y) != X_standardized.shape[0]:
print("Mismatch detected: Adjusting 'y' to match 'X_standardized'.")
y = y[:X_standardized.shape[0]] # Adjust 'y' if necessary

# Project the data using the top eigenvectors
X_projected = np.dot(X_standardized, top_eigenvectors)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
custom_colors = ['red', 'green', 'blue', 'purple', 'orange']
cmap = mcolors.ListedColormap(custom_colors)
scatter = ax.scatter(X_projected[:, 0], X_projected[:, 1], X_projected[:, 2], c=y, cmap=cmap)

# Set labels and title
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('Clusters of Drug on PCA components')


unique_labels = sorted(set(y))
legend_elements = [plt.Line2D([0], [0], marker='o', color='w', label=label, markerfacecolor=color)
for label, color in zip(unique_labels, custom_colors)]
plt.legend(handles=legend_elements, title='Label')
plt.show()

This 3D scatter plot visualizes the dataset after applying Principal Component Analysis (PCA) to reduce its dimensionality and capture the essential variations using three principal components (PC1, PC2, PC3), which together explain approximately 67% of the variance. The plot demonstrates how PCA effectively segregates different drug categories (labeled from 0 to 4) in the reduced-dimensional space.

Each axis in the plot represents one of the principal components, and each data point is colored based on its drug category label, showing distinct clustering patterns among different groups. This clear separation in the PCA-transformed space suggests that these components capture significant differences among the categories, which might be crucial for tasks like classification or detailed data analysis.

The visualization emphasizes the utility of PCA in exploratory data analysis by reducing complexity and revealing hidden patterns, making it easier to understand underlying data structures and enhancing the efficiency of subsequent analytical tasks.

Original Dataset vs PCA-Transformed Data

# Select three specific features for the example
# Here, let's take 'Age', 'Na_to_K' (assuming they are in your dataset), and 'BP' after encoding
selected_features = ['Age', 'Na_to_K', 'BP']
X_selected = X[selected_features]

# Standardize the selected data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_selected)

# Apply PCA
pca = PCA(n_components=2) # Reduce to two dimensions for visualization
X_pca = pca.fit_transform(X_scaled)

# Plotting
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), subplot_kw={'projection': '3d'})

# Original Data
sc1 = ax1.scatter(X_scaled[:, 0], X_scaled[:, 1], X_scaled[:, 2], c=y, cmap='viridis', edgecolor='k', s=50)
ax1.set_xlabel(selected_features[0])
ax1.set_ylabel(selected_features[1])
ax1.set_zlabel(selected_features[2])
ax1.set_title('Original Data in Feature Space')
plt.colorbar(sc1, ax=ax1, pad=0.1)

# PCA-transformed Data
sc2 = ax2.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
ax2.set_xlabel('Principal Component 1')
ax2.set_ylabel('Principal Component 2')
ax2.set_title('Data in PCA Projected Space')
plt.colorbar(sc2, ax=ax2, pad=0.1)

plt.show()

# Print the variance explained by the principal components
print("Variance explained by PC1: {:.2f}%".format(pca.explained_variance_ratio_[0] * 100))
print("Variance explained by PC2: {:.2f}%".format(pca.explained_variance_ratio_[1] * 100))

The above plots illustrate the transformation of the dataset through PCA:

Original Data in Feature Space:

Axes: The plot showcases the original features: Age, Na_to_K, and BP.

Data Distribution: Points represent observations, colored according to their labels (0 to 4). The data is dispersed across the three-dimensional space, with visible clustering patterns.

Data in PCA Projected Space:

Axes: The transformed data is plotted against the first two principal components (PC1 and PC2).

PCA Transformation: The data points are more spread out along PC1 and PC2, which capture the most significant variations.

Variance Explained:

  • PC1 explains 39.54% of the variance.
  • PC2 explains 32.11% of the variance.
  • Together, PC1 and PC2 explain 71.65% of the total variance, showing that these two components retain most of the dataset’s essential information.

Here, my aim is to visually observe the changes that occur when PCA is applied to a dataset. By comparing the original feature space with the PCA projected space, we can see how PCA effectively spreads out the data along new axes that capture the most variance. This visualization helps us understand the underlying structure of the data and how PCA can simplify complex datasets while retaining crucial information.

Applications of PCA:

PCA is utilized across various fields and applications, showcasing its versatility in handling different types of data:

  • Exploratory Data Analysis: PCA is often used in exploratory data analysis to visualize the structure of the data and detect patterns, trends, and outliers.
  • Predictive Models: In predictive modeling, PCA is used to speed up the training time and combat overfitting by reducing the number of features in the model.
  • Image Compression: PCA can reduce the dimensionality of image data by transforming the original pixels into a smaller set of values, thus saving storage space while preserving key features of the image.

Conclusion

Principal Component Analysis (PCA) is an invaluable technique for simplifying high-dimensional datasets by transforming them into principal components that maximize variance. This process reduces data complexity, making it easier to explore, visualize, and uncover significant patterns. PCA is particularly beneficial for tasks such as clustering, visualization, and feature reduction in predictive modeling. By retaining the most critical information with fewer variables, PCA enhances our understanding of data structure, supports better decision-making, and improves the efficiency of subsequent data processing steps. In essence, PCA not only aids in dimensionality reduction but also highlights key relationships within the data, facilitating insightful data-driven strategies.

For a deeper exploration of the codes and extended examples, visit my GitHub repository.

https://github.com/berna14y/LogReg_PCA_ADMM_NMF/blob/main/ML_final_project.ipynb

Thank you for reading, and happy learning !

RESOURCES

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf

--

--