Feature Engineering — Principal Component Analysis on News Headline Embeddings

Haykaz Aramyan
LSEG Developer Community
3 min readJul 12, 2023

Overview

The full article can be found on LSEG’s Developer Portal where we discuss the workflow in detail.

One of the fundamental stages of a successful Artificial Intelligence (AI) solution is feature engineering. The goal of the process is maximization of the AI predictive power by transforming the raw data in an effort to provide high insight designed feature sets. One very common technique for dimensionality reduction is Principal Component Analysis (PCA) which helps us uncover the underlying drivers hidden in our data by summarising huge feature sets into a few principal components.

This guide will use PCA as a practical tool for reducing the dimension of feature embeddings from news headlines derived from LSEG Lab’s BERT-RNA, a financial language model created by LSEG Labs. We will show how PCA can impact the performance of the ML model. We will also discuss several approaches for selecting the optimal number of principal components.

Article in Brief

Describing the dataset

Here we describe the Financial PhraseBook dataset by Malo et al whicg has been used in the article. The dataset consists of 4845 news headlines carefully labeled by 16 experts and M.Sc. students with financial services backgrounds. Furthermore, we used LSEG Lab’s financial language modeling tools to get feature embeddings of those headlines.

Financial Phrasebook Dataset
Embeddings from LSEG BERT-RNA

Selecting the optimal number of principal components

Multiple approaches exist for selecting the optimal dimensionality, starting from NLP approaches based on the average or maximum length of text input to explained variance of principal components using Root Mean Square Error (RMSE).

In this section, we present several techniques, including a. calculating the average sentence length and useing that as a number of components for PCA, b. Plotting the cumulative variance and taking the number of components explainoing 95% of the variance, c. Plotting the RMSE and selecting the number of components with steepest decline in RMSE.

Impact of PCA on predictive power of the model and the training time

In order to have an initial measurement the impact of dimensionality reduction of news headline embeddings through PCA, we train multiple logistic regression models using different numbers of principal components of the feature space. Additionally, we track training time for each model and plot the time along with the training and test accuracies. This will allow us to better understand the impact of dimensionality reduction with regards to computational demand of the models.

For interpretations of the results and codes, please visit the main article on LSEG’s Developer Portal.

Downloads

Related Blueprints

--

--