Unsupervised Learning — Principal Component Analysis (PCA)

Nishant Kumar
Analytics Vidhya
Published in
4 min readMar 29, 2020

Namaste from INDIA! (corona effect). Lend me your ears/eyes to deep dive into PCA.

Random PIC (to get attention :))

Lets take example of COVID-19 dataset where no. of datapoints are very less compared to no, of features or variables which leads to curse of dimensionality error, PCA comes as a saviour. Principal component analysis is a dimensionality reduction technique by identifying correlations and patterns in a dataset so that it can be transformed into a dataset of significantly lower dimension without loss of any important information.
So, it’s a feature extraction technique to combine i/p variables/features in a specific way, in order drop the “least important” variables while still retaining the most valuable parts of all of the variables. Each of the “new” variables after PCA are all independent of one another. This is a benefit because the assumptions of a linear model require our independent variables to be independent of one another. If we decide to fit a linear regression model with these “new” variables , this assumption will necessarily be satisfied.

In fact, every principal component will ALWAYS be orthogonal (perpendicular) to every other principal component. Because our principal components are orthogonal to one another, they are statistically linearly independent of one another. In general, the nth principal component of a dataset is perpendicular to the (n — 1)th principal component of the same dataset.

PCA

Principal component analysis is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data.
PCA is a way of reducing the number of independent variables in a dataset and is particularly applicable when the ratio of data points to independent variables is low. PCA transforms a linear combination of variables such that the resulting variable expresses the maximum variance within the combination of variables.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible),PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after a normalization step of the initial data. The normalization of each attribute consists of mean center — subtracting each data value from its variable’s measured mean so that its empirical mean (average) is zero — and,
possibly, normalizing each variable’s variance to make it equal to 1; see Z-scores. and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

=>Highly correlated features — means High Bias — Remove 1 variable out of that.

=>PCA removes inconsistencies, redundant data and highly correlated features.

=> Non-Parametric and solves overfitting — occur by high -dimesionality. Filter noisy dataset eg. — image compression.

PCA steps:
1. Standardization of data :-
Scaling Data — major pre-processing steps in PCA is to scale the variables using z-score or standardscalar to scale each variable so that both of them have similar range, due to which variance can be comparable.
z = variable value — mean/ standard deviation
2. Computing Covariance matrix :- measure of how each variable is associated with one another.

3. Calculating EigenVectore and EigenValues:-
The directions in which our data are dispersed. (Eigenvectors.) — Vectors whose direction doesn’t change upon linear transformation.
The relative importance of these different directions. (Eigenvalues.) — denaotes scalar of respection eigenvectors
PCA combines our predictors and allows us to drop the eigenvectors that are relatively unimportant.
4. Computing Principal Components :-
Principal components are basically vectors that are linearly uncorrelated and have a variance with in data. From the principal components top p is picked which have the most variance.
PCA transforms a linear combination of variables such that the resulting variable expresses the maximum variance within the combination of variables.
New set of variables from actual are obtained by ordering Eigen vectors in descending order, where the eigenVector with Highest eigenvalue is the most significant and thus forms the First Principal Component.
5. Reducing the dimension of data by selecting best components without information loss.

Python code sample:-

# scaling
from scipy.stats import zscore
XScaled=X.apply(zscore)
XScaled.head()
# Covariance
covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
# PCA
pca = PCA(n_components=6)
pca.fit(XScaled)
#Eigenvalues
print(pca.explained_variance_)
#Eigenvectors
print(pca.components_)
#percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)

# based on the values chose 3 best components and then use Regression with these component features
pca3 = PCA(n_components=3)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)

regression_model_pca = LinearRegression()
regression_model_pca.fit(Xpca3, y)
regression_model_pca.score(Xpca3, y)

Refrence :-
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://medium.com/@raghavan99o/principal-component-analysis-pca-explained-and-implemented-eeab7cb73b72
https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0
https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0

Follow my other blogs:-

Dataset sources for DataScience Lovers

Facebook | Instagram

--

--