Factor Analysis in Python - Characterising Companies Based on Financial Metrics During Covid19

Quantitative Python
The Startup

--

Note: All codes are available in the Repo: https://quanp.readthedocs.io/en/latest/tutorials.html

Previously, the author performed principle component analysis on the financial metrics for all the S&P500 companies and found that the first 5 PCs carried most of the variance ratio. This article does not intend to replicate the previous work. It is recommended to first read through the previous article before proceeding.

A principal component (dimension) from PCA can be considered as a factor that consists of a space that made up of a set of features. Fundamentally, PCA or a similar analysis, Factor analysis (FA), allows variables that are correlated with one another but largely independent of other subsets of variables to combine as components/ factors. Both PCA and FA summarise patterns of correlations among observed variables and reduce a large number of observed variables (features/dimensions) to a smaller number of components/factors. Frequently, these factors/components analysis produces an operational definition for an underlying processes by using correlation/contributions (loadings) of observed variable in a factor/component (Tabachnick & Fidell, 2013).

Here, we will dive deep at the first 5 PCs/factors and their respective underlying features.

1. Download the data

Here, we get the 505 S&P500 member companies listed on the wikipedia & get a list of fundamental metrics for each company from the TD Ameritrade API (All functions are available from the quanp tools).

The list of potentially useful fundamental variables/features:-

2. Simple features preprocessing

The distributions of variables for factor analysis (FA)/principle component analysis (PCA) are not in force. However, if variables are normally distributed, the analysed result is usually enhanced (Tabachnick & Fidell, 2013). Here, similar to the previous article, we only do 2 simple and standard preprocessing, log(x+1) Transformation followed by a Standardization Scaling. (Technical note: PCA is typically applied to standardized data. With standardized data “variation” means “correlation”. With unstandardized data “variation” means “covariance”. All data in this course will be standardized before applying PCA.)

3. Dimensional reduction using PCA

PCA, a form of orthogonal rotation, so that the extracted components/factors are uncorrelated with each other, can reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data. Previously, we found that the ‘elbow’ point of the PCA variance ratios seems to suggest at least up to PC5 will be useful to characterize the companies.

For instance, we also found that the Information Technology, Financial, and Energy can be separated from low to high PC1.

qp.pl.pca(adata, color=['GICS_Sector'], size=50, 
groups=['Financials', 'Energy',
'Information Technology']);

4. Dissecting PCA outputs

In fact, PCA or a similar analysis, Factor Analysis (FA), fundamentally allow variables that are correlated with one another but largely independent of other subsets of variables to combine as components/factors. Both PCA and FA summarise patterns of correlations among observed variables and reduce a large number of observed variables (features/dimensions) to a smaller number of components/factors. Frequently, these factors/components analysis produces an operational definition for an underlying processes by using correlation/contributions (loadings) of observed variable in a factor/component (Tabachnick & Fidell, 2013).

In order to obtain the PCA eigenvectors (i.e. cosines of rotation of variables into components) and eigenvalues (i.e. the proportion of overall variance explained) to calculate the loading of each component (i.e. eigenvectors normalized to respective eigenvalues; loadings are the covariances between variables and components), we exported the data from the anndata to a pandas dataframe and re-performed the pca analysis using the sklearn.decomposition.PCA.

from sklearn.decomposition import PCA# export to dataframe
df_fundamental_anndata = adata.to_df()
# Apply PCA by fitting the good data with the same number of
# dimensions as features

n=5
pca = PCA(n)

# Transform log_samples using the PCA fit above
pca_values = pca.fit_transform(df_fundamental_anndata)

We can calculate loadings of PCA for each components using the following formula:-

pca_loadings = pca.components_.T * np.sqrt(pca.explained_variance_)pca_loading_matrix = pd.DataFrame(pca_loadings, columns =    
['PC{}'.format(i) for i in range(1, n+1)],
index=df_fundamental_anndata.columns)
pca_loading_matrix['Highest_loading'] = pca_loading_matrix.idxmax(axis=1)pca_loading_matrix = pca_loading_matrix.sort_values('Highest_loading')

From the heatmap below, we can see some patterns of loadings explaining each PCs, but they are not too obvious. This is because the goal of PCA is to extract maximum variance from a dataset with a few orthogonal components, in order to provide only an empirical summary of the dataset. We proceed further for the Factor Analysis to see if the loadings can provide a clearer underlying patterns for each factor.

import seaborn as sns

plt.figure(figsize=(25,5))

# plot the heatmap for correlation matrix
ax = sns.heatmap(pca_loading_matrix.drop('Highest_loading',
axis=1).T,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(220, 20, n=200),
square=True, annot=True, fmt='.2f')

ax.set_yticklabels(
ax.get_yticklabels(),
rotation=0);

5. Factor analysis (FA)

In contrast to PCA, the goal of FA (if it is orthogonal rotation) is to reproduce the correlation matrix with a few orthogonal factors.

Testings for Factorability of R

A matrix that is ‘factorable’ should include severable sizable correlations. The expected size depends, to some extent, on N (larger sample sizes tend to produce smaller correlations), but if no correlation exceed 0.30, use of FA is questionable because there is probably nothing to factor-analysed.

The Barlett test of Sphericity is a not-so-good sensitive test of the hypothesis that the correlation in a correlation matrix are zero — the test is likely to be significant with substantial/large sample size, even if the correlation are very low. Therefore, the test is recommended only if there are fewer than 5 samples per variable. In our case, we have 23 variables with 505 samples (companies) in total (we can only maximum of 115 samples to qualify for this test’s recommended sample size) — we are not qualified for this test. However, we are just going to perform this to see if it is significant. The P-value is 0 (significant), although it is not a reliable “preferable conclusion”.

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericityprint('Bartlett-sphericity Chi-square: {}'.format(calculate_bartlett_sphericity(df_fundamental_anndata)[0]))print('Bartlett-sphericity P-value: {}'.format(calculate_bartlett_sphericity(df_fundamental_anndata)[1]))

Alternatively, Kaiser-Meyer-Olkin test (Kaiser’s measure of sampling adequacy) is a ratio of the sum of squared correlations to the sum of squared correlations plus sum of squared partial correlation. The value approaches 1 if partial correlation are small. Values of 0.6 and above are recommended for a a good FA.

from factor_analyzer.factor_analyzer import calculate_kmokmo_all, kmo_model = calculate_kmo(df_fundamental_anndata[df_fundamental_anndata.columns]);print('KMO score: {}'.format(kmo_model));

Estimating number of factors and filtering for variables with communalities > 0.2

Here, as we are only interested to know how many significant factors we should use for the following work, we used a simple factor analysis without any rotation. In fact, we can just use the estimation based on the PCA variance ratio (eigenvalue) above.

Again, we see that the elbow point on the screen plot fall at about F5. Next, the communalities of the variables/features are inspected to see if the variables are well defined by the solution. Communalities indicate the percent of variance in a variable that overlaps variance in the factors. Ideally, we should drop variables with low communalities, for example, exclude those variables with <0.2 communalities. Here, found only 22 of 34 variables had commualities >0.2. However, for the exploratory purpose of this dataset, we don’t consider removing those variables.

Factor analysis with Varimax (orthogonal) rotation and Maximum Likelihood factor extraction method

We can revisit the correlation matrix plot of the features/variables of this dataset — shown below. The underlying variables in respective factors (the following figure) are usually highly correlated but least correlated with the underlying variables in other factors.

import seaborn as sns

plt.figure(figsize=(25,25))

# plot the heatmap for correlation matrix
corr = df_fundamental_anndata.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(corr,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(220, 20, n=200),
mask=mask, square=True,
annot=True, fmt='.2f')
import seaborn as sns

plt.figure(figsize=(25,5))

# plot the heatmap for correlation matrix
ax = sns.heatmap(fa_loading_matrix.drop('Highest_loading',
axis=1).T,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(220, 20, n=200),
square=True, annot=True, fmt='.2f')

ax.set_yticklabels(
ax.get_yticklabels(),
rotation=0);

As a rule of thumb, only variables with loading of 0.32 and above are interpreted. The greater the loading, the more the variable is a pure measure of the factor. Comrey and Lee (1992) suggest that loadings of:-

>71% (50% overlapping variance) are considered excellent;
>63% (40% overlapping variance) very good;
>55% (30% overlapping variance) good;
>45% (20% overlapping variance) fair;
<32% (10% overlapping variance) poor;

In summary, based on the heatmap above:

  • the Factor 1 (FA1) seems to suggest operational profitability or valuation of a company;
  • FA2 is more correlated to volatilities of company share values/market capitals;
  • FA3 seems to suggest long term debt obligations;
  • FA4 seems to suggest solvency and/or short-term debt obligation (i.e. quick/current ratio); and lastly,
  • FA5 seems to be correlated with operational profitability specifically on the gross margin performance of a company.

Please see more on indicators/metrics related to each canonical fundamental factor on evaluating fundamental performance of a company.

5. Final author’s opinion

Factor analysis (FA) dimensional reduction seems to be a better approach than PCA (which is a more empirical summary approach compared to FA) if we are interested in a theoretical solution uncontaminated by unique and error variability and have designed our study on the basis of underlying constructs that are expected to produce scores on the observed variables. Some machine learning algorithms struggle with highly-correlated features. In this FA study, we were using Varimax rotation (an orthogonal rotation so that all the factors are uncorrelated with each other) — this could in turn be more useful latent variables for factor model in both multiple linear or logistic-based regressions, eg. to predict positive or negative return of a stock price performance. We will try out these uncorrelated latent variables/factors to see if they work out well in predicting these events in the next article.

References:-

  1. Loadings vs eigenvectors in PCA: when to use one or another? https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another
  2. Making sense of principal component analysis, eigenvectors & eigenvalues https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/35653#35653
  3. Tabachnick & Fidell. Using Multivariate Statistics, Sixth Edition. PEARSON 2013; ISBN-13:9780205956227.

--

--

Quantitative Python
The Startup

The host works as a full-time computational biologist. He loves exploring widely between different industries and marry the techniques between the domains.