Building your ML Models like a Pro with SAS Viya

Jose Tan
Project Plutonium
Published in
8 min readApr 18, 2021

Discover how to leverage on SAS Viya to build financial credit scoring models with ease (so to focus your time on the analysis and save some wrinkles from forming on your forehead!)

Photo by Tim Gouw on Unsplash

If you are a budding data scientist like us, struggling to install multiple machine learning packages to fit your problem, you might have came to the right place! SAS Viya is an AI, analytics and data management platform that is changing the way we build our solutions in a modern and scalable architecture. This article serves as an introduction into the world of SAS Viya through the context of a credit scoring project.

Meet the Team

Our team members are Ngoh Wei Jie, Wes, Yohana Meiliana and Jose Tan. We are students from Singapore Management University, majoring in Business Analytics. In this module — IS454 Applied Enterprise Analytics, we were introduced to SAS Viya and other related tools to help us understand the integrated process in managing analytical models for optimal performance over a lifespan of time. We were mentored by Prof Seema Chokshi and SAS Instructor Ms. Gemma, whose guidance and support throughout are deeply appreciated.

Our Project

By Wes Ngoh, Yohana Lee and Jose Tan

Project Plutonium — A proposed strategy on how we can reduce customer defaults and maximize interest earned from loans.

(PS: Plutonium is the most dangerous element on the periodic table and that we want to avoid engaging customers who are going on default for profit maximization. Hence, the project titled, Project Plutonium)

Problem Definition

Credit risk management has always been a primary focus for financial institutions. To help in interpreting a customer’s risk level, a credit rating will be assigned to determine the creditworthiness of a debtor in the financial institution.

Objective

Using SAS Viya, we aim to create a personalized and intelligible risk parameter model by deploying the following techniques:

  • Scoring customers based on their profiles using a common scorecard to determine the customer’s credit worthiness through Logistic Regression.
  • Classifying customers based on various attributes, experimenting and deploying multiple different Tree-Based classifiers.
  • Predicting probability of credit default by constructing a multilayer perceptron Neural Network.

The main business objectives for Project Plutonium are:

  1. Safeguard credit risk
  2. Maximize interest earned from loans

Approach Description

In summary, our approach will take on the following tasks:

  1. Data Pre-processing — Handling missing data, data transformation, level encoding and variable selection
  2. Fitting of Models — Chosen these techniques as mentioned since they are being used extensively in financial institutions
  3. Further Fine-tuning of Models — Hyperparameter tuning, utilizing auto-tuning function
  4. Evaluation of Models — Based on the KS (Youden) index to determine the champion model

Data Exploration

Using SAS Visual Analytics, we are able to explore the dataset and identified highly correlated columns and relationship between our target variable (GB) and other columns.

Correlation Matrix in SAS Visual Analytics
Automated Explanation Feature in SAS Visual Analytics

Data Analysis

Solution Pipeline

Our solution pipeline includes running machine learning using Logistic Regression, Gradient Boosting, Decision Tree, Forest and Neural Network.

Logistic Regression

We proceed to built a logistic regression credit-scoring model using scorecards. There are primarily two basic functions of scorecards: Application and Behavioral. We used both the functions to create a balanced scorecard that correctly classifies customers into bad and good credit customers.

To accomplish this in SAS Viya, Interactive Grouping and Scorecard nodes were created in the pipeline.

Ranking of variables based on Information Value

Interactive Grouping groups input variables into bins that are eventually used as input variables for predictive modeling. We have used the variables’ information value (IV) to rank their importance for variable selection.

Scorecard table

Weight of Evidence (WOE) is introduced to the variables and subsequently used to assign a score for each bin in the scorecard. The WOE is calculated based on the proportion of good events and bad events at each group level.

Using the interactive grouping and newly transformed WOE dataset, the final KS (Youden) score achieved is 0.2432 with a misclassification rate of 0.05.

Tree-Based Modelling

Tree-Based models are a popular choice in the finance institutions because they require less feature engineering and its ease of interpretation. The tree-based models that were built for this case study are: Decision Trees, Forest and Gradient Boosting.

Results of Variable selection

Variable selection was used to reduce the number of inputs for modelling, given that we have 25 total inputs. The node finds and selects the best variables for analysis by using unsupervised and supervised methods. We have selected the following criterions: Unsupervised selection and Fast Supervised selection. With the combination criterion of ‘Selected by at least 1’, a total of 11 inputs were selected.

Results of before and after Auto-tuning

All models were run with and without auto-tuning to get the best combination of hyperparameters. The models performed better with auto-tuning based on KS Youden.

Summary Results for Tree-Based Models with and without Auto-tuning

Forest performed the best among the tree-based models used for this project with a KS Youden of 0.8406 and a misclassification rate of 0.0467.

Neural Network

Lastly, our team seeks to tap on the advantages of neural networks such as it’s computation power to build a credit scoring model in SAS Viya.

Results of model initialization

Neural network strategies have many parameters to set up and initialize. The Neural Network settings build is based on the documentation of Best Practices and Other Tips provided by SAS. Some settings were tweaked to get the best results.

We have also used the limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm as our optimization method as it converges faster and is easier to use than Stochastic Gradient Descent (SGD).

Auto-tuning best configuration

By invoking the auto-tuning property, the application searches for the best combination of values in different properties such as the number of hidden layers, the number of hidden units for each hidden layer, the L1 regularization and L2 regularization parameters.

The number of neurons in the hidden layer is determined by Latin hypercube sample. Our best model was a Multilayer Perceptron(MLP) with two layers, Input Layer and Output Layer. The Input Layer consists of 53 input variables and 0 neurons in the hidden layer. The Output Layer consists of 2 neurons.

Finally… Results!

Result of models

The champion model for this project is Forest. The model was chosen based on the KS (Youden) using the Validate partition (0.84). 95.33% of the validation data were correctly classified using the Forest model.

Most Important Variables for Champion Model

The five most important factors are AGE, TMJOB1, PRODUCT, PROF, BUREAU.

Our team did not find it surprising knowing that Forest was our best performing model. However, we did not expect the neural network model to be the least performing one, due to its capabilities to model high complexity problems. After further research, we then understood that neural networks are most appropriate when used for pure prediction tasks. Unfortunately, the goal of our case study is to classify customers into Good or Bad categories to determine their credit worthiness.

Our Prototype

Overview of solution

Following our model building, our team also went on to create a web application prototype as a proof-of-concept to showcase the business values that the model can immediately provide to existing and potential clients. The web application seeks to assist our clients to quickly assess their credit risk exposure to bad customers and allow clients to churn out a target prediction (Good/Bad) based on certain demographics.

The four main functionalities are: Upload File, Dashboard, PlutoAnalyze and PlutoPredict.

Upload file

Users can upload files(.csv) of customer records files to determine customers with Good/Bad risk.

Dashboard displaying overview of dataset ingested

Upon uploading the file, the page will direct the user to a dashboard. The user can get an overview and gain an understanding of the dataset.

PlutoAnalyze

Users can then gain deeper insights about the statistical relationship between different attributes and the predicted value through various charts.

Models behind Pluto Prudict

Lastly, users can predict a customer’s credit risk by changing the various inputs such as Profession and Nationality. The prediction is based on the champion model built previously in our solution pipeline! Convenient isn’t it?

Our SAS Viya Experience

Overall, our team had loads of fun building models, creating dashboards and exploring the features offered by SAS Viya and Visual Analytics. It was astonishing to see how automation ties with analytics to create powerful, predictive machine learning models in seconds.

Moving forward, we hope to be able to sync the SAS Machine Learning pipeline directly with SAS Visual Analytics to replicate the models on both platforms. Additionally, our team would like to build a SAS-powered Web Application with end-to-end authentication to display the dashboard for higher management, and automate model training for real time accurate prediction. In this manner, with every new record uploaded into the application, the model will be trained using the latest dataset to offer the best performance!

We hope that you have enjoyed reading our process on solving credit rating issues as much as we have enjoyed working on this project. We also wish that you now have a better idea of what you can achieve with SAS Viya.
Cheers!

--

--