Data Engineering - Data Exploration for M&A predictive modeling

Published in

LSEG Developer Community

4 min readMay 17, 2023

Overview

The full article can be found on LSEG’s Developer Portal where we discuss the Data Exploration phase of AI pipeline. This enables us to uncover insights from the data and/or identify areas or patterns to dive into and further explore. This phase will also reveal any problems that the datasets might have, including dataset imbalance, multi-collinearity, missing values, outliers etc.

This guide will walk through the exploration process of the M&A dataset we have acquired during the Data Ingestion phase in our Blueprint Data Ingestion for M&A predictive modeling.

Article in Brief

Getting high-level information about the dataset

Here we describe the general information on the structure of the dataframe using the info and describe functions in Pandas. These functions will give us basic statistics and meta information on non-null values per feature, dtype of the columns in memory etc.

Deep dive into individual feature distributions

In order to get more insight on the individual feature distributions we generate plots that will allow us to draw some conclusions visually. Here, we plot the distribution of gross profit margin to get more insights on that.

In reality, during the Data Engineering phase of building AI models, we would want to do this exercise for all of the features, however, we will stop here as the purpose of this guide is just to showcase the major tools and workflows.

Make statistical inferences on the dataset

Amongst other statistical exploration tools that we can use, we need check our features against multi-collinearity which can greatly bias regression models. We can do that by simply looking to a correlation matrix and also conduct a Variance Inflation Factor (VIF) analysis which is a measure of the amount of multi-collinearity in a set of multiple regression variables.

According to the correlation Heatmaps we can see quite a few features correlated with each other. For example, Market Capitalization is highly correlated with the Revenue measure, EV to Sales with Price to sales and more.

The VIF test can help us determine which one to drop. The higher the VIF measure, the higher the variance that is captured by the variable. VIF values for other variables will decrease if we iteratively remove the variables with the highest VIF measures.

Another statistical inference that can be made from our dataset and which can be useful especially for AI models addressing classification problems is the significance of the mean differences of features between two classes. That could indicate potential feature importances for our AI model. A T-test analysis is one of the best practices when it comes to making a statistical inference on mean differences between two datasets.

For interpretations of the results and codes, please visit the main article on LSEG’s Developer Portal.

References

Downloads

Related Blueprints

The full article can be found on LSEG’s Developer Portal, which focuses on the ingestion phase of an AI pipeline analysing corporate events. Particularly, we discuss how one can ingest M&A and Fundamental & Reference data for target and non-target companies via Refinitiv Data Platform (RDP) API.