Dimensionality reduction with Factor Analysis on Student Performance data

A dimensionality reduction technique with interpretable outputs

Alexandre Henrique
Geek Culture
9 min readNov 15, 2021

--

Photo by NeONBRAND on Unsplash

Often, the datasets we use for data analysis and Machine Learning tasks have several variables. Using such datasets as they are can damage model performance and significantly increase training time or make it extremely difficult to analyze data and get insights from it. Exploratory Factor Analysis (FA) is a dimensionality reduction technique that attempts to group intercorrelated variables together and to produce interpretable outputs.

In this post, we are going to talk about the general idea behind factor analysis, and understand it through a hands-on approach using the Student Performance Data Set to classify whether or not a student will succeed in his math class.

Terms and General idea

In order to apply Factor Analysis, we must make sure the data we have is suitable for it. The simplest approach would be to look at the correlation matrix of the features and identify groups of intercorrelated variables. If there are some correlated features with a correlation degree of more than 0.3, perhaps it would be interesting to use Factor Analysis. Groups of features highly intercorrelated will be merged into one variable latent, called factor.

Hence, the obtained factor creates a new dimension that “explains” the group of features that composes it. The projection of the scores of a variable on a factor is called factor score, and the correlation of a variable with a factor is called factor loading. If we sum the squares the factor loading of a variable, we get a quantity called communality, which ranges from 0 to 1, and measures how much the variance of a variable is explained by the factors.

Thus, the factor scores can be used in regression and classification tasks as new features of a space with fewer dimensions. On the other hand, factor loadings are especially useful to measure the importance of a particular variable to a factor.

Procedure

The following diagram shows a step-by-step procedure on how to perform Factor Analysis. As you surely expect, the starting point is to preprocess the data by doing feature encoding, feature selection, removing outliers (FA models are very sensitive to outliers), and what else you think you can do.

Then, we enter the phase of checking whether the data is suitable or not for FA. I mentioned before that you can simply look at the correlation matrix of the features. However, this is rather simple and if you are taking things seriously, you should also check if the sample size is adequate and perform some statistical tests.

After assessing data quality, now comes what we are really looking forward to doing: fit a FA model. As is the case with Principal Component Analysis (PCA), we don’t know beforehand how many dimensions we’ll need to boil down our dataset in order to retain a fair amount of the variance of the features. Therefore, we need a criterion for choosing the number of factors (dimensions) for the dataset, later on we’ll talk about the Scree Plot and the Kaiser Criterion.

The main advantage of using FA instead of PCA is that the outputs are much easier to interpret. So, naturally, the last step of FA is to interpret the factors using the information of which variables each of them tries to explain.

Factor Analysis Step-by-Step diagram

Predicting Student Performance

As an example, we are going to apply the process described in the last diagram to the Student Performance Dataset, interpret the output factors and use them in a classification task to predict student grades. Detailed information about the dataset and the complete source code can be found in this Kaggle notebook.

The dataset originally has 33 variables and about 395 students from two Portuguese schools in their Math class. The features include student grades, demographic, social, and school-related information. We have information about the first (G1), second (G2), and final grade (G3), but we’ll try to predict G3 without using G1 and G2 since these variables are highly correlated with G3, and it is not useful to use G1 and G2 as we want to grasp exclusively the relationship of the other variables with G3.

Assumptions — Assessing data suitability

To be suitable for factor analysis, a dataset must satisfy several assumptions:

  1. Normality: Features with a normal distribution improve considerably the results of statistical tests. Moreover, this makes it possible to generalize the results of the analysis beyond the sample collected.
  2. Linear relations: there must not be a perfect correlation between pairs of variables, if so, drop one variable from each pair.
  3. Factorability: Check if at least some variables of the dataset are correlated and they can turn into coherent factors.
  4. Sample Size: Should be large enough to yield reliable estimates. Ideally, The dataset must have at least 20 records per variable.

For detailed information on this topic, access this page.

Factorability

Factorabilty is one of the most important assumptions for FA. There are 3 ways of assessing factorability:

Correlation matrix

To verify if the data is suitable for FA, you can verify if there are at least some correlations > 0.3. If so, the algorithm for FA will be able to find groups of intercorrelated variables. One limitation of this approach is that as the number of variables in the dataset increases, it becomes practically impossible to keep track of the relationships among variables.

Bartlett’s Test of Sphericity

Bartlett’s Test of Sphericity indicates whether the correlation matrix is an identity or not, and accordingly, the p-value of the test should be significant (p < 0.05). The p-value obtained for the Student Performance Dataset was 0.

Kayser-Meyer-Olkin (KMO) Test

The KMO test measures the sampling adequacy of the data. It helps determine how adequate the whole dataset is along with each variable. KMO values range between 0 and 1. There is not a general agreement on this matter, but many people say that a KMO value less than 0.5 is considered inadequate.

The overall KMO value of the dataset was 0.489, which is not good. However, After removing all variables where KMO < 0.5, the overall KMO was 0.639:

Choosing the number of factors

Now we have a suitable dataset, it is time to fit a model. The factor_analyzer package is probably the best choice you’ll have. Its interface is based on Scikit-Learn estimators. Therefore, the code follows the same logic.

After fitting the model, you’ll have access to the eigenvalues of the factors correlation matrix. The number of factors your model will have is determined by the number of eigenvalues greater than 1 according to the Kaiser criterion.

That being the case, you first fit a model with the number of factors equal to the number of variables:

Then, you’ll inspect how many eigenvalues are greater than 1:

and retrain the model indicating this amount of factors:

Another widely used method for selecting the number of factors is the Scree Plot analysis. It is a visual tool to select a number of factors. You can also do it programmatically as I did above. Plot the eigenvalues against the number of factors and count how many eigenvalues are greater than 1.

Scree Plot

The original dataset after variables encoding had 38 features. Then, after removing those considered inadequate by the KMO test, we were left with 22 features. And finally, using Kaiser Criterion, we decreased the number of features (in this case, factors) to 9.

Factors Interpretation

Once we have the new model, we must interpret the factors. In fact (I want to stress this point), that is the whole point of doing dimensionality reduction with Factor Analysis: having an interpretable dataset with fewer features.

Factor loadings heatmap

In the left figure, you can see the factor loadings heatmap. The strength of the relationship of each variable to each Factor is determined by the factor loading.

Factors loadings range from -1 to 1 and can be interpreted as the correlation of the variable with the factor. Therefore, we assign each variable to the factor that is most correlated to it in absolute value. By doing so, we have:

Variables grouped into factors

The groups of variables somehow make sense. Take a look at which variables are grouped under each factor representation:

  • Factor0: Workday and weekend alcohol consumption.
  • Factor1: Mother and Father education, and whether or not attended nursery school.
  • Factor2: Has internet access at home or the mother work at home.
  • Factor3: Student Address and home to school travel time.
  • Factor4: Weekly study time, number of past class failures, and if the student wants to take higher education.
  • Factor5: Family educational support and whether or not the student took extra paid classes.
  • Factor6: Student’s age and whether or not the student has extra educational school support.
  • Factor7: Free time after school and how frequently the student goes out with friends.
  • Factor8: Student’s sex, parent’s cohabitation status, whether or not the student takes extra-curricular activities, and his/her number of absences.

Once we are satisfied with the factors interpretations, we can obtain the new dataset by applying a transformation to the original dataset:

Binary and Multiclass classification results

Finally, with the new dataset, we can perform supervised learning tasks. There are 3 types of tasks we can do:

  1. Binary classification: try to predict if a student will succeed or not.
  2. Multiclass classification: try to classify the student performance considering the 5-level classification based on the Erasmus grade conversion system: fail, sufficient, satisfactory, good, and very good.
  3. Regression: try to predict the student's final grade.

In this study, I only did the first two classification tasks and compared them to the results reported in the original paper and to the results obtained using the dataset as it is.

Binary classification results

As we can see, using Factor Analysis proved to be more than worth in the Binary classification setup. We managed to improve performance by 11.2% compared to the results of the original paper.

Multiclass classification results

Meanwhile, in the multiclass classification setup, we also managed to improve the accuracy of the classifier even though it wasn’t better than using the dataset without applying factor analysis. Nevertheless, you must remember that the dataset with FA has considerably fewer features. Therefore, training time and CPU resources consumption are much lower.

Thanks for your time, I would appreciate your feedback. The source code with comments and the complete exploratory factor analysis are available here. Happy coding!

--

--