Application of SAS EM for Machine Learning: Firm Collapse Prediction

Published in

Shikhar’s Data Science Projects

7 min readJan 8, 2023

Firm collapse prediction has been a subject of interest for almost a century and it still ranks high among the hottest topics in economics. The aim of predicting financial distress is to develop a predictive model that combines various econometric measures and allows one to foresee a financial condition of a firm. The purpose of bankruptcy prediction is to assess the financial condition of a company and its future perspectives within the context of longterm operation on the market.

What is SAS EM?

SAS Enterprise Miner (EM) is a data mining and machine learning software package developed by SAS Institute. It is a powerful tool for data scientists and analysts to build predictive models and conduct data mining on large datasets. With SAS EM, you can import data from various sources, explore and visualize the data, build and compare different models, and deploy the best performing models for scoring new data. SAS EM provides a wide range of algorithms and techniques for supervised learning (e.g., decision trees, logistic regression, neural networks) and unsupervised learning (e.g., clustering, association rules). It also has many built-in functions for data preprocessing, feature selection, and model evaluation, which makes it an efficient and user-friendly platform for data science projects.

Is SAS Enterprise Miner outdated? Why is it important to know?

It is true that the field of data science and machine learning is constantly evolving, and newer software tools and platforms are being developed all the time. However, this does not necessarily mean that older tools such as SAS Enterprise Miner (EM) are completely outdated or no longer relevant. Here are a few reasons why it might still be important to know SAS EM:

Widely used in industry: SAS EM is a well-established software package that is widely used in many industries, including finance, healthcare, and marketing. Knowing how to use SAS EM can make you a valuable asset to these organizations.
Large user base: There is a large community of SAS EM users, and many resources are available online for learning and troubleshooting. This can make it easier to find help or guidance when working with SAS EM.
Comprehensive feature set: SAS EM provides a wide range of tools and features for data preprocessing, modeling, evaluation, and deployment. It is a comprehensive platform that can handle many different types of data science projects.
Scalability: SAS EM is designed to handle large datasets and can scale up to meet the demands of complex projects. This can be particularly useful when working with big data.

In summary, while it is always important to be aware of the latest developments in the field, SAS Enterprise Miner (EM) is still a valuable tool that is widely used and has a comprehensive feature set.

Implementation: Predicting Firm Collapse using various models and ensemble techniques

Data Files can be found at https://www.kaggle.com/competitions/bankruptcy-classification-project/data

bankruptcy_Train.csv — the training set with 64 predictors and 1 target variable
bankruptcy_Test_X.csv — the test set with ID and 64 predictors
bankruptcy_sample_submission.csv — the sample submission with ID and the predicted probability of firm bankruptcy

Exploratory Data Analysis

Filtering and replacement to improve the data

Importance of filtering:

Filtering the data to remove outliers
The method used is Rare Values
(Percentage) i.e. if the value occurs
less than 0.01 percentage times, then it will be removed
Default filtering method is Standard Deviation from the Mean

Importance of Replacement:

The limits of the attributes are set based on standard deviation from the mean
Replacement values are computed

Initial Model

The data is being divided into train and validation as 70%:30% resp

In this step we select the appropriate models
In the next step, we would be fine tuning each model individually to bring out their best performance
Further in the last step, we would ensemble them together

Ensemble of the four models increase the validation ROC

Ensemble methods are machine learning techniques that combine the predictions of multiple models to make more accurate predictions. The idea behind ensemble methods is to create a strong learner by aggregating the predictions of multiple weak learners. A weak learner is a model that is slightly better than random guessing, while a strong learner is a model that performs significantly better than random guessing.

There are several ways to combine the predictions of multiple models, such as averaging, voting, or weighting. The specific method used depends on the type of ensemble and the goal of the modeling.

The validation receiver operating characteristic (ROC) curve is a useful evaluation metric for imbalanced datasets because it is sensitive to the class balance of the dataset. In an imbalanced dataset, the class with a minority of instances (the “minority class”) is often the one of interest, and it is important to have a metric that can accurately evaluate the performance of a model on this class.

The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. The TPR is the proportion of positive instances that are correctly classified as positive, while the FPR is the proportion of negative instances that are incorrectly classified as positive. The area under the ROC curve (AUC) is a measure of the overall performance of a classifier. A model with a high AUC is able to correctly classify a higher proportion of positive instances and a lower proportion of negative instances.

One advantage of using the ROC curve for evaluation is that it is not affected by the class balance of the dataset. In an imbalanced dataset, a model that simply predicts the majority class all the time can achieve a high accuracy, but this does not necessarily mean that the model is good at predicting the minority class. The ROC curve and AUC provide a more nuanced evaluation of the model’s performance, taking into account the TPR and FPR for both classes.

In summary, the validation ROC curve and AUC are useful evaluation metrics for imbalanced datasets because they are sensitive to the class balance of the dataset and provide a more nuanced evaluation of a model’s performance.

Key takeaways

→ Gradient Boost, Neural Network, Logistic Regression and HP Neural were used to individually tune to provide a descent ROC Score on validation data

→ Further, When all the four models are ensembled together, they provide a reasonably better ROC of 93.8%

→ Ensemble technique of averaging the event probabilities would combine the weak learners and maximize their predictive power

Final Model

Final Model ensembles the previously created models

Multi level ensemble to improve accuracy

Creating 5 ensembles using the previously created models
However, the important change is that the seed for each and every model is changed so that the overfitting on training data can be avoided
This method would be efficient on test data as it combined several weak and strong learners and maximizes the predictive power

Assessing the final model

•Validation ROC is 95.2% which is based on ensemble of several learners with different seeds so as to avoid overfitting

•We can see from the Public and Private Leaderboard that this model provided a better ROC as it was not overfit on the training data

Learnings:

→ In case of imbalanced data, we need to look at the ROC as the performance criteria

→ Randomizing improves the performance on test data

→ Neural Network and Gradient Boosts are two strong models for cases where data has interaction and imbalance

→ Greedy Algorithms such as decision tree, etc. did not perform well in such cases

Final Rank on Private Leaderboard

The rank on public leaderboard was 13th, and the rank on final private leader board came to be 3rd among 45 teams from Purdue University MS BAIM Program