Principal Component Analysis(PCA) Simplified
Link to the code: Github
Imagine you had a dataset with 1000 features. To visualize all these features and to try to explain the relationships between these features would be a nightmare. Moreover, your model runs the risk of Overfitting. Overfitting in simple terms means that your model has memorized your dataset patterns too much such that it does not perform well when given new data.
The number of features in a dataset is referred to as its dimensionality. Therefore a dataset with a lot of features is said to be high-dimensional and a dataset with a small number of features is said to be low-dimensional. Therefore to solve our problem statement above, what we need to do is transform a highly dimensional dataset into a low dimensional data set. This is called dimension reduction.
Please note that dimension reduction is not deleting the columns. It is mathematically transforming information in your columns to capture the same exact information, but using fewer columns. For instance, if you had two highly correlated features, you can combine them into a new feature.
Benefits of Dimension Reduction
- Consumption of less computational resources.
- Faster running models.
- Improvement of your model performance.
- Better Data Visualization.
One of the most popular techniques that helps us with dimension reduction is Principal Component Analysis(PCA).
PCA was invented in 1901 by Karl Pearson and is still being used to date, having proven how extremely efficient it is with regard to dimension reduction.
- Manually calculating and generating the principal components. — PCA has a mathematical approach to it. We will generate principal components manually in order to fully understand the concept.
- Using the scikit-learn library — We’ll leverage the scikit-learn library which automatically outputs and generates the principal components for us. This is what you will ideally use when creating a machine learning model. But it is important to understand the concept first using method 1.
Steps to perform PCA
- Covariance Matrix
- Eigen Decomposition
- Sort By Eigen Values
- Choose your Principal Components
When analyzing data we deal with datasets whose features greatly vary in magnitude and units. For instance, you could be dealing with features measured in kgs, km, grams, cms, etc. By applying machine learning techniques to these features as they are, then your algorithm, for instance, would consider 100grams to be greater than 1kg which is not true and our algorithm will give us wrong predictions.
We, therefore, need to come up with a way to standardize our features before applying any algorithm to them. This means therefore when dealing with variables such as weight(0–10000grams), age(0–100 years), and salary(0–8000 Usd), feature scaling would standardize them to be in the same range, for example, range (0,1) depending on the scaling technique used.
Please note: Standardization might not be necessary for PCA if the scale of your variables is consistent across variables.
After you standardize your data set the next step is to create covariance matrix.To understand the Covariance Matrix, we’ll need to first understand what is variance and covariance.
Variance — measures how much your data is spread out.
In the image below, we have x and y variance where x- variance shows how much data is spread in the horizontal direction and y- variance shows how much data is spread in the vertical direction. Therefore by just looking at the image below, the x-variance is higher than the y-variance because data is much more spread on the horizontal axis.
Covariance — this is a measure that describes the relationship between variables. In variance, you cannot get the relationship since you are just using one variable. However, when you combine two variables, you can get information on how they relate with each other as well as their direction. i.e If a variable x is increasing, will a variable y increase or decrease or remain unchanged.
The covariance between two variables can be positive, negative, and zero. This can be viewed as your data having positive correlation, negative correlation, or no correlation as in the image below.
Covariance Matrix — this is a square matrix that shows the variance of variables and the covariance between a pair of variables in a dataset. If we have variables X, Y, and Z with their values as shown below then we would first calculate the variance of X, Y, and Z which is 80.3, 33.037, and 142.5 respectively.
Z has the highest variance and Y has the lowest variance.
Once you have the variance of your elements, you now calculate the covariance and create a covariance matrix as shown in the image above.
cov(X,Y) is -13.865 — This is a negative number which means as X increases, Y decreases, and vice-versa.
cov(X,Z) is 14.25 — This is a positive number which means as X increases Z increases, and vice-versa.
cov(Y,Z) is -39.525 — This is a negative number which means as Y increases, Z decreases and vice-versa.
Therefore the formulae for our 3 variables above look like below where the diagonal elements represent the variance of your dataset and the off-diagonal terms show the covariance between a pair of variables.
Now that you have the covariance matrix above, the next step is called eigen decomposition which is simply the process of producing eigenvalues and eigenvectors. Often an eigenvalue is found first, then an eigenvector is found to help us get principal components(new variables we get as a result of as the combinations or mixture of the initial variables)
Eigenvectors tell us the direction of our dataset. If your dataset has 2 variables, say age of a person and the income, you would ideally expect 2 eigenvectors which explain the direction of age and income. You would also have 2 eigenvalues which indicate the amount of variance in the eigenvectors.
If you had a two-dimensional dataset as shown below then you would have two vectors u and v .In order of hierarchy, u would be considered your first vector(direction in which the data varies the most)and v the second vector( direction of greatest variance among those that are perpendicular) to the first eigenvector. If you had a third eigenvector then this would be one whose direction has the greatest variance among those perpendicular to the first two vectors, and so on.Eigenvalues in simple English would be the length of your arrows which explain the amount of variance in each vector.
Please visit this link to get more understanding of eigenvectors and values.
Sort by Eigen Values and Choosing Principal Components
Principal components are the new variables we get as a result of the combinations or mixture of the initial variables. Eigenvectors are usually multiplied by your original dataset to get principal components.
Eigenvectors with the lowest eigenvalues contain the least information about the distribution of the data, and those are the ones we would ideally drop. So if out of say 3 eigenvalues and you take the top two highest values, then you would end up with 2 dimensions of your dataset. Once you decide on the eigenvalues, you now multiply the original data to the corresponding eigenvectors to get principal components. If you had age, height, and weight for instance contributing to obesity, if weight and height have the highest eigenvalues, then you will drop age having the lowest eigenvalue. You will then take the eigenvectors of weight and height and you would get the principal components. These will be the new features in your dataset.
Manually calculating and generating the principal components
- Load your data.
[6., 3., 2.],
[3., 2., 7.],
[5., 4., 2.],
[1., 4., 3.],
[7., 3., 1.0],
[5., 1., 8.],
[4., 2., 2.],
[8., 6., 6.],
[6., 3., 2.],
[7., 1., 1.]])
2. Standardize and Compute the covariance matrix.
3. Eigen Decomposition to get eigenvalues and eigenvectors.
4. Sort by Eigen Values to get eigenvalues and eigenvectors in order of significance.
5. Multiply the chosen eigenvectors to your original data to get principal components.
PCA using Scikit- Learn
- Load your data — we will use pandas inbuilt dataset for wine.
2. Standardize your dataset and fit it into the PCA method.
By specifying the number of components to 2 (
n_components=2), you’re asking PCA to find the two components that best explain variability in the data.
You can set a preferred number of components based on your requirements but two is usually the simplest to interpret and visualize on a scatterplot.
3. Output your new dimensions.
If you had other feature(s) in your dataset such as the target variable, you will then concatenate the new features to your other features in the dataset and now build your machine learning model.
When to use PCA
- When you want to reduce the number of your variables but are not able to clearly identify the variables you want to remove.
- When you want to make sure your variables are independent of each other.
Link to the code: Github