A Simplified Guide to Acquire a Certificate and Enhance Your Resume: Master Feature Engineering in Kaggle — lesson ⅚ (Principal Component Analysis)

Mohaddeseh Tabrizian
8 min readJul 22, 2023

--

A Simplified Guide to Acquire a Certificate and Enhance Your Resume: Master Feature Engineering in Kaggle — lesson ⅚ (Principal Component Analysis)

From Midjourney

introduction:

One of the essential skills for a machine learning engineer is feature engineering. To acquire this skill, Kaggle course is an excellent resource where you can learn and gain certification. Having a certification from Kaggle is a valuable addition to your resume.

However, during my experience with Kaggle courses, I encountered difficulties in understanding the concepts and examples provided. Realizing that others might face similar challenges, I took the initiative to simplify the course for you. I have researched and provided explanations for the topics that may have been unclear (I have put the concepts that were simple enough to understand untouched).

After reading this article, your next step is to put your learning into practice by solving the exercises in the lessons. Once you have completed them, you can obtain your certificate. This is the link to the lesson 5 from 6 lessons of feature engineering tutorial in Kaggle: https://www.kaggle.com/code/ryanholbrook/principal-component-analysis

Principal Component Analysis (PCA) is a statistical technique commonly used in data analysis to simplify and summarize a dataset. It aims to reduce the dimensionality of the data while retaining as much of its variability as possible.

In simpler terms, PCA helps us understand and represent complex data in a more concise way by identifying the most important patterns and relationships within the dataset. It does this by creating new variables, called principal components, which are linear combinations of the original variables. These principal components capture the maximum amount of variation in the data.

(Technical note: PCA is typically applied to standardized data. With standardized data “variation” means “correlation”. With unstandardized data “variation” means “covariance”. All data in this course will be standardized before applying PCA.)

What is standardized data?

Standardized data, also known as normalized data, is a data preprocessing technique that transforms the values of the variables in a dataset to have a common scale or distribution.

What is Correlation?

Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It quantifies the degree to which two variables are linearly related.

Correlation is typically represented by a correlation coefficient, which ranges from -1 to 1. A positive correlation coefficient indicates a positive relationship, meaning that as one variable increases, the other variable tends to increase as well. A negative correlation coefficient indicates a negative relationship, meaning that as one variable increases, the other variable tends to decrease.

What is Covariance?

Covariance is a measure of how two random variables change or vary together. It is used to quantify the relationship between two variables and determines whether they move in the same direction (positive covariance) or in opposite directions (negative covariance). A positive covariance indicates that when one variable increases, the other tends to increase as well, while a negative covariance indicates that when one variable increases, the other tends to decrease.

Principal component analysis

With a dataset called abalone (An abalone is a sea creature much like a clam or an oyster) dataset, I explain “axes of variation” that helps in understanding how Principal Component Analysis works. The simplest explanation of what is “axes of variation” is that instead of describing the data with the original features, we describe it with its axes of variation. The axes of variation become the new features. We’ll just look at a couple features for now: the ‘Height’ and ‘Diameter’ of their shells. The point of looking at these features is to describe the ways the abalone tend to differ from one another. Notice that instead of describing abalones by their ‘Height’ and ‘Diameter’, we could just as well describe them by their ‘Size’ and ‘Shape’. (better options in my opinion)

How do we construct these new features ( for example in abalone dataset)?

From Kaggle

Analysis of the figure: Often, we can give names to these axes of variation meaning the new features created by PCA may not have direct interpretations in terms of the original features. In the picture, shape and size are the new features. The axis that we called the “Size” component: small height and small diameter (lower left) contrasted with large height and large diameter (upper right). The axis that we called the “Shape” component: small height and large diameter (flat shape) contrasted with large height and small diameter (round shape).

These new features are just linear combinations (weighted sums) of the original features, for example:

df[“Size”] = 0.707 * X[“Height”] + 0.707 * X[“Diameter”]

df[“Shape”] = 0.707 * X[“Height”] — 0.707 * X[“Diameter”]

(df is short for data frame, remember our dataset is in this variable).

From Kaggle

Analysis of the figure: These new features (size and shape) can tell us that an increase in the size of the abalone doesn’t have that much effect on the shape. But from the left picture, we had when the height increases the diameter increases too.

These new features are called the principal components of the data. The weights themselves are called loadings. There will be as many principal components as there are features in the original dataset: if we had used ten features instead of two, we would have ended up with ten components.

Why are we using 0.707 in the equations above in the size and shape and What do they mean?

From Kaggle

A component’s loadings tell us what variation it expresses through signs (+/-) and magnitudes (values).

This table of loadings is telling us that in the Size component, Height and Diameter vary in the same direction (same sign), but in the Shape component they vary in opposite directions (opposite sign). In each component, the loadings are all of the same magnitude and so the features contribute equally in both.

What other things can we learn from PCA?

From Kaggle

Analysis of the figure: PCA also tells us the amount of variation in each component. We can see from the figures that there is more variation in the data along the Size component than along the Shape component. PCA makes this precise through each component’s percent of explained variance.

The Size component captures the majority of the variation between Height and Diameter. It’s important to remember, however, that the amount of variance in a component doesn’t necessarily correspond to how good it is as a predictor (it depends on what you’re trying to predict).

What is Explained Variance?

Explained variance, also known as explained variation, is a concept used in statistics and data analysis to represent the proportion of variance in a dependent variable that can be explained or accounted for by an independent variable or set of independent variables.

What is Cumulative Variance?

Cumulative variance, is the cumulative sum of explained variances across multiple independent variables. In other words, it represents the proportion of total variance in the dependent variable that can be explained by a set of independent variables.

Both explained variance and cumulative variance are important measures when analyzing the relationship between variables and assessing the predictive power of independent variables in a statistical model. These measures help determine the extent to which the independent variable(s) contribute to the variability in the dependent variable.

PCA for Feature Engineering

There are two ways you could use PCA for feature engineering:

  • Just getting an idea from PCA to choose our features: Since the components tell you about the variation, you could compute the MI scores for the components and see what kind of variation is most predictive of your target. (MI score stands for Mutual Information score. It is a measure used to quantify the amount of information that two variables share. Specifically, the MI score measures the mutual dependence or relationship between two variables. In the context of feature selection or variable importance assessment, the MI score is often used to evaluate the relevance or importance of a feature (independent variable) in relation to the target variable (dependent variable). A higher MI score indicates a stronger relationship or dependence between the two variables, suggesting that the feature contains more useful information for predicting the target variable). That could give you ideas for kinds of features to create for example a product of ‘Height’ and ‘Diameter’ if ‘Size’ is important, or a ratio of ‘Height’ and ‘Diameter’ if Shape is important. You could even try clustering on one or more of the high-scoring components.
  • Using components themselves as features: often the components can be more informative than the original features. Here are some use-cases:

Dimensionality reduction: When your features are highly redundant (for example Multicollinearity problem: Multicollinearity refers to a situation in which two or more independent variables in a statistical model are highly correlated with each other. In other words, multicollinearity occurs when there is a strong linear relationship between independent variables, making it difficult to separate their individual effects on the dependent variable. Multicollinearity can be problematic for several reasons) PCA will partition out the redundancy into one or more near-zero variance components, which you can then drop since they will contain little or no information. So, PCA helps in removing them.

Anomaly detection: Unusual variation, not apparent from the original features, will often show up in the low-variance components. These components could be highly informative in an anomaly or outlier detection task.

Noise reduction: A collection of sensor readings will often share some common background noise. PCA can sometimes collect the (informative) signal into a smaller number of features while leaving the noise alone, thus boosting the signal-to-noise ratio.

Decorrelation: Some ML algorithms struggle with highly-correlated features (they depend on each other too much). PCA transforms correlated features into uncorrelated components, which could be easier for your algorithm to work with meaning the algorithm can learn better.

Notes to remember:

  • PCA only works with numeric features, like continuous quantities or counts.
  • PCA is sensitive to scale. It’s good practice to standardize your data before applying PCA, unless you know you have good reason not to.
  • Consider removing or constraining outliers, since they can have an undue influence on the results.

Don’t forget to read the example provided at the end of the lesson:))

--

--