# Introduction

This post reviews the principal component analysis (PCA) concept. PCA is a feature or dimensionality reduction technique. Features are the elements that we use their values for each sample of data we use for training. Consider the following example we want to have a machine learning (ML) model for predicting the price of houses from their other characteristics. These characteristics are called features. Features are called dimensions as well. Each feature can be considered an axis, too, so calling them dimensions is another way of looking at what is happening.

PCA detects the most important features among all of them. PCA mechanism relies on the variance concept. At the bottom line, PCA shows if M number of principal components (most essential features) are used in the model and how much of the variance of the original data will be maintained. But after this introduction, everybody would ask why we use PCA?

PCA is a feature reduction technique, so we will have a smaller model because we detect that some of those features do not affect our predictions. After knowing this, we can gather and collect fewer data. It results in more efficiency and speed from a computational perspective. In a nutshell, PCA helps to decrease the required computation with keeping the quality of the predictions.

At this point, we know why PCA is good. It is because we are going to do fewer calculations for our model. This feature reduction mechanism detects and shows essential features that keep the variance of the data. The next step is to see how it works and what it means when PCA is always accompanying variance. Thus, reviewing the variance, covariance, and covariance matrix concepts is needed before discussing how PCA works. Also, having a good understanding of eigenvalues and eigenvectors is essential.

# Background

This section reviews all fundamental concepts that a person needs to understand PCA fully.

## Variance and Standard Deviation

Variance is a measure of dispersion (scattering), which measures how far a set of numbers is spread out from their average value.

The following example shows how red and blue datasets are dispersed from their average. Standard deviation is the square root of the variance.

## Covariance

Covariance is co+variance. When the “co” prefix appears with a word, it adds a collective meaning to the dish. Covariance is a measure of the joint variability of two variables.

Different covariance values illuminate how data is dispersed regarding the independent variables.

## Correlation

Correlation refers to the degree to which a pair of variables are linearly related. It is the covariance of two variables divided by the multiplication of their standard deviation.

The following figure shows what correlation can reveal with its value. When it is said correlation, we should consider linearity.

## Covariance Matrix

The covariance matrix is a square matrix giving the covariance between each pair of elements of a given set of variables (or vector/list of variables — features). Any covariance matrix is symmetric with its main diagonal containing variance. Note the covariance of each variable with itself is the variance of that variable.

## Eigenvalues and Eigenvectors

Eigenvectors and eigenvectors of a transformation (matrix) show the vectors and scaling factors with which those vectors scale under that transformation.

Keeping all eigenvectors in a matrix, in a way that each column is an eigenvector we build a transformation matrix whose axes are eigenvectors.

# How does PCA work?

At this point, we have reviewed all the required fundamental concepts to grasp how PCA works.

## Step 1: Mean vector is calculated

The mean of all samples for each feature is calculated in the first step. This is needed because later we will have a system of eigenvectors of the covariance matrix. So, in that transformation, we will need the average of our dataset to shift the new coordinate system there. If our dataset has N features, we will have an N element mean vector.

## Step 2: Covariance matrix of the dataset

In this step, the covariance matrix of the dataset is calculated. As dispersion is considered in this matrix, so because of this part, we mention that PCA considers variance of data as the key. If we have N features, we will have an N*N covariance matrix.

## Step 3: Calculating Eigenvectors and Eigenvalues of the Dataset

Next, eigenvectors and eigenvalues of the covariance matrix are calculated. We will have an N*N matrix keeping eigenvectors. Then eigenvectors are sorted based on the size of their eigenvalues. It is important because eigenvalues show the variance on each eigenvector that we call the principal component (PC). For example, the following formula shows the variance kept by each eigenvector regarding its eigenvalue. It is called the variation portion.

The following formula shows how much variance is kept by keeping the first k principal components. It is called the cumulative variance.

After visualizing the effect of keeping a specific number of principal components, the figure would look as follows:

## Step 4: Transforming to the latent space

After deciding the number of principal components to use to keep the desired variance of original data, the following formula shows how original data is transformed. The reason for using the mean vector is that eigenvectors of the covariance matrix need to be representative of the dataset. So, the dataset is shifted to the middle of the new coordinate system.

For transferring back to the original space:

In the following example, you can find how you can do PCA with python and NumPy library.

`def mean_vec(dataset):    mean_vector = []        for i in range(dataset.shape):        tmp_mean_vector = np.mean(dataset[i, :])        mean_vector.append(tmp_mean_vector)            mean_vector = np.array(mean_vector)    return mean_vectordef pca(dataset, k):    ''''    step 1: mean of features of all samples (it is assumed that features are on rows)    step 2: calculating covariance matrix    step 3: calculating eigenvalues and eigenvectors and sorting based on the eigenvalues    step 4: choosing k vector to form the transformation matrix    step 5: transforming the dataset with the transformation matrix    '''    # step 1    mean_vector = mean_vec(dataset)        # step 2    covariance_matrix = np.cov([dataset[i,:] for i in range(dataset.shape)])        # step 3    eig_val_cm, eig_vec_cm = np.linalg.eig(covariance_matrix)        # Make a list of (eigenvalue, eigenvector) tuples    eig_pairs = [(np.abs(eig_val_cm[i]), eig_vec_cm[:,i]) for i in range(len(eig_val_cm))]    # Sort the (eigenvalue, eigenvector) tuples from high to low    eig_pairs.sort(key=lambda x: x, reverse=True)        # step 4    matrix_w = np.hstack([eig_pairs[i].reshape(dataset.shape, 1) for i in range(k)])        return matrix_w# consider you have a spreedsheet of data as datasetmatrix_w = pca(dataset, 10)transformed_dataset = matrix_w.T.dot((dataset - mean_vector).T)`

For understanding it, read it several times, and try to test something. At the end to sum up look at the following figure that shows how using PCA can decrease the dimensionality to 2 by using red and blue as basis vectors for a new system. The variance on the green one is negligible, so it is the candidate for being put aside.

--

--