Understanding Principle Component Analysis, Part-1.

Anup Bhande
5 min readApr 13, 2018

--

Using only descriptive analysis, univariate or bivariate inferential analysis are not enough to obtained the information that needed for the decision factor on data-sets consisting many variable. In such case, the multivariate analysis can help to extract main information from data-sets consisting a large number of variables and get additional details from data set that support decision process. Principle Components Analysis(PCA) is one the Multivariate Analysis.

Principle Component Analysis is method of data processing, thorugh which we can extract small number of synthetic variables called principle component, from large number of variable. Principle Component(PCA) is method of projecting data, mutually uncorrelated and ordered in variance. But before directly proceeding on PCA, lets see what the mathematical term which will help better understanding PCA. So this blog will cover the mathematics terms which are necessary to know before diving into PCA.

Various Mathematics terminology.

This section will give brief overview of various mathematics definition that we need to know before understanding the process of Principle Components Analysis(PCA). Each definition is define separately. Since nowdays computation are strong there no needs to remember this terms, but understanding this will show how and why the decision factors get affected and our outcome.

a) Statistics: This entire subject is to dealing with large data sets and we have to find out relationship individual data points.

  1. Standard Deviation: The standard deviation show us how much distance we are from mean, irrespective it is positive or negative. In mathematical form it is given as,

where ‘s’ is the standard deviation, ‘(n-1)’ is sample of data set, but if we use entire dataset we will use ’n’, X(i) is data single point and X(bar) is mean of data. More the standard deviation value more is the data point is spread.

2. Variance: It show how the data is spread in data set. The mathematical representation is given by,

The term are same the standard deviation. Its only that by squaring standard deviation we get variance. And S(squared) is symbol of variance.

3. Covariance: Standard deviation and variance can be calculated in 1 dimensional. This parameter are calculated by using single attribute. But when there are problem in data set, where we need to measure how much dimension it changes between two or more attribute with respect to each other, there this are of no use directly.

So in order to measured the data set parameter on 2 dimensional level covariance is use. If you calculate covariance between one dimensional and itself you get variance. But when calculate between (x, y) you get covariance. Number of attribute can be many. The mathematical representation of covarince is given by,

Where ‘cov(X, Y)’ is the symbol of the covariance, x and y are the arttibutes. (n-1) is sample subsets. When covariance is calculate, if answer is positive then it shows if one increase other also increase, vise-versa. If it is negative then if one attribute increase then other decrease, vice-sersa. But if the answer is zero then that shows they are independent of each other.

4. Covariance Matrix: Above we see how to calculate covariance, it is use full in 2-dimensions, but if there are more than 2-dimensions data set, then we need to calculate more than 1 covariance eg. 3-dimensions (x, y, z), then covariance are cov(x,y), cov(y,z), cov(z, x). Thus for n-dimensions it will increase to cov(n, n+1) where ’n’ is the attribute.

So to present all covariance between number of dimension is to calculate it and put them in a matrix form. It is given as

5. Eigenvectors: Eigenvector can be define by using mathematical representation as follow,

If the given matrix is A and v is the vector, as long as the given condition is satisfy then ‘v’ is said as eigenvector. To satisfy this above two product it to be fullfill some condition. Condition are as

1)The given matrix ‘A’ has to square matrix.

2) Eigenvector ‘v’ mus non-zero vector.

3) ‘λ’ must correspond to a nonzero vector.

Here in the above equation both value i.e λ and v are unknown. This above equation stats, we are looking at vector ‘v’ such that ‘x’ and ‘Ax’ are in the same direction. It also to be noted that, this isn’t true for most vector. Typically sometime ‘Ax’ does not point in same direction.

6. Eigenvalue: To define eigenvalue we have to use above equation, the ‘λ’ above mention is nothing but eigenvalue.

Conclusion:

In this blog we initially saw why every time just using univariate or bivariate inferencial analysis won’t help to solve complex problem in which there are multivariate problem. Then we saw what is the alternative to solve this problem, and thus by using PCA we solve it. In order to know more about Principle Component analysis we study some mathematical terms. In next blog we will see explanation on Principle Component Analysis(PCA).

--

--