Kernel PCA: From Uncovering the Hidden Patterns of Wealth to Turning Noisy Images into Clear Images

Published in

The Power of AI

5 min readOct 3, 2023

Have you ever wondered if there are commonalities among the world’s top billionaires? If so, then this blog is for you! Get ready to dive into the fascinating world of data science and Python programming as we explore the power of Kernel Principal Component Analysis (Kernel PCA). By leveraging Kernel PCA, we will unravel the hidden patterns and structures in non-linear data, allowing us to gain valuable insights from our datasets. I highly recommend you check out this guided project in CognitiveClass to get a deeper understanding of Kernel PCA.

Use Kernel PCA To Find Why Are You Poor

Learn to identify patterns in data using Python programming and Data Science. Explore Kernel Principal Component…

cognitiveclass.ai

In this blog, we will challenge stereotypes and examine whether investment bankers and college dropout entrepreneurs in the tech industry truly dominate this exclusive group. While we can’t definitively answer this question, we can use Kernel PCA to analyze the richest people in the world and determine if any patterns emerge.

Source: https://www.gcu.edu/blog/gcu-experience/facts-about-rich-americans

Kernel PCA is an innovative extension of the traditional Principal Component Analysis (PCA). Unlike PCA, which is effective only for linear relationships, Kernel PCA excels at uncovering complex patterns and structures in non-linear data. By mapping the data into a higher-dimensional feature space, non-linear relationships become linear, enabling Kernel PCA to capture intricate structures and similarities that may remain hidden otherwise.

Source: https://ml-lectures.org/docs/structuring_data/ml_without_neural_network-2.html

Beyond Billionaires
But that’s not all! In addition to analyzing billionaires worldwide, we will also utilize Kernel PCA to denoise images. This application showcases the versatility of this unsupervised learning technique and demonstrates its relevance beyond the realm of financial data analysis. By understanding how Kernel PCA can turn noisy images into clean ones, you’ll gain a broader perspective on its potential applications in various domains.

Awesome! Now that you have some understanding of Kernel PCA and its capabilities, let’s jump into coding!

Installing required libraries

# All Libraries required for this lab are listed below. 
!pip install pandas==1.3.4 numpy==1.21.6 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==1.0.2 scipy==1.7.3 joblib==1.3.2 threadpoolctl==3.1.0

Importing Required Libraries

# Surpress warnings from using older version of sklearn:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline


from sklearn.decomposition import PCA,KernelPCA
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


warnings.filterwarnings('ignore')

Application 1: Kernel PCA to Predict if You’re the Richest Person in the World

“The World’s Billionaires” is an annual ranking documenting the net worth of the wealthiest billionaires in the world, compiled and published in March, annually, by the American business magazine — Forbes. We obtained the dataset from here.

The features available from the dataset are:

Rank
Name
Net Worth — their net worth in billions USD
Age
Country
Source — their source of income
Industry — sector/industry/market segment in which each billionaire has made their fortune

Step1: Preparing Data

We load the dataset and take a look to see if it is loaded properly.

# Download the dataset and read it into a Pandas dataframe
df=pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/billionaires.csv',index_col="Unnamed: 0")
df.head()

Now, we assign the target feature that the model is going to predict:

y=df['rank']
y.head()

Next, we drop all the features that are not useful for the task:

df.drop(columns=['name','networth','source'],inplace=True)
df.head()

Since our categorical variables “country” and “industry” are not ordinal, meaning that the categories don’t have a specific order, we utilize one-hot encoding to convert their levels into dummy variables. Other remaining variables in the dataset don’t require this, so we specify remainder = ”passthrough” to exclude them from the encoding.

one_hot = ColumnTransformer(transformers=[("one_hot", OneHotEncoder(), ['country','industry']) ],remainder="passthrough")
data=one_hot.fit_transform(df)

names=one_hot.get_feature_names_out()
column_names=[name[name.find("_")+1:] for name in  [name[name.find("__")+2:] for name in names]]
new_data=pd.DataFrame(data.toarray(),columns=column_names)
new_data.head()

Step2: Applying PCA and Kernel PCA

Let’s define a Kernel PCA object and fit it into this new data:

kernel_pca = KernelPCA(kernel="rbf" ,fit_inverse_transform=True, alpha=0.1)
kernel_score=kernel_pca.fit_transform(new_data)

We also want to compare with PCA:

pca = PCA()
score_pca = pca.fit_transform(new_data)

Step3: Comparing PCA and KPCA for the prediction task:

Comparing Kernel PCA and PCA, we see that Kernel PCA generally performs better given a higher R² (coefficient of determination) score on the test set:

X_train, X_test, y_train, y_test = train_test_split(kernel_score, y, test_size=0.4, random_state=0)
lr = Ridge(alpha=0).fit(X_train, y_train)
print(str.format("Test set R^2 score for Kernel PCA: {}", lr.score(X_test, y_test)))


X_train, X_test, y_train, y_test = train_test_split(score_pca, y, test_size=0.40, random_state=0)
lr= Ridge(alpha=0).fit(X_train, y_train)
print(str.format("Test set R^2 score for PCA: {}", lr.score(X_test, y_test)))

Application 2: Noise Reduction with Kernel PCA

You will take advantage of dimension reduction to compare how well PCA and Kernel PCA remove noise in images while still retaining key features.

Step1: Preparing Data

First, we load the necessary train and test data, which are created using the USPS digits dataset that has been modified to include noise.

X_train_noisy = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/X_train_noisy.csv').to_numpy()
X_test_noisy = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/X_test_noisy.csv').to_numpy()

Here is a function to plot the digits that you will use in the end. In the image, you can see that the original digits are fairly hard to see:

# Helper function for plotting the digit images
def plot_digits(X, title):
    """Small helper function to plot 100 digits."""
    fig, axs = plt.subplots(nrows=10, ncols=10, figsize=(8, 8))
    for img, ax in zip(X, axs.ravel()):
        ax.imshow(img.reshape((16, 16)), cmap="Greys")
        ax.axis("off")
    fig.suptitle(title, fontsize=24)
    
plot_digits(X_test_noisy, "Noisy test images")

Step2: Fitting PCA & Kernel PCA Objects

First, we create a PCA object called pca and fit it to the noisy test set X_test_noisy. We do the same for KernelPCA object kernel_pca using an RBF kernel.

Note that n_components has to be a number between 0 and min(n_samples, n_features) = 216 for PCA.

pca = PCA(n_components=35)
pca.fit(X_train_noisy)

kernel_pca = KernelPCA(n_components=400, kernel="rbf", gamma=0.01, fit_inverse_transform=True, alpha=0.1)
kernel_pca.fit(X_train_noisy)

Step3: Reconstructing the Digits

Next, we’ll compare PCA and Kernel PCA for noise reduction in the original images. We’ll apply the inverse transformation to the lower-dimensional transformed data to assess their effectiveness. This analysis will provide insights into their respective capabilities in removing noise and restoring image clarity.

X_hat_pca = pca.inverse_transform(pca.transform(X_test_noisy))
X_hat_kpca = kernel_pca.inverse_transform(kernel_pca.transform(X_test_noisy))

Step4: Visualizing Denoised Digit Images

Finally, let’s visualize the images reconstructed from PCA and Kernel PCA! Which one reveals digit images with less noise?

plot_digits(X_hat_pca, "Reconstructed Test Set (PCA)")
plot_digits(X_hat_kpca, "Reconstructed Test Set (Kernel PCA)")

Fantastic! You have acquired valuable skills that will empower you in your data science journey. Through the implementation of Kernel PCA, you have gained a good understanding of non-linear dimensionality reduction techniques and their applications in prediction tasks as well as denoising images. Enrol in the guided project on Cognitive Class today to apply your learnings today!

Stay connected and continue your data science journey by following me on Medium or LinkedIn. I, as part of the IBM Skills Network, also create super interesting content on AI and data science. Happy learning!

Fateme Akbari

I'm a data-loving Ph.D. Candidate with a passion for making a real impact. I thrive on applying science to improve the…

author.skills.network