Predicting Loan Grades with a Neural Network: A Machine Learning Pipeline on AWS

Published in

Universal Mind

10 min readSep 8, 2017

In a previous article we looked at predicting interest rates and loan grades using the managed AWS Machine Learning service. While AWS Machine Learning offers a convenient way to build and use regression models without building an underlying processing pipeline, sometimes more customization or control may be needed. For example, you may want to use a more complex deep learning model that performs better on your dataset. Here’s a high level look at the flow we’ll explore in this article.

As with the previous article, relevant source code is available in a github repo.

Working on a Model Locally

An inevitable first step in a machine learning pipeline will be getting familiar with the data. This will likely involve data cleaning and some exploratory data analysis. Jupyter notebooks have become a common way to share the results of this analysis as they allow step by step explanation of interesting findings in a dataset along side the output of the code written to analyze or process the data. Checkout a sample Jupyter notebook for our loan data set here.

Data Preparation

As we’re no longer using the AWS Machine Learning service, we have a little extra work to do to prepare our data to be fed into the model:

One-hot Encoding — We’re trying to predict a non-numeric loan grade category: “A” through “G”. Neural networks don’t do letters, only numbers. So rather than give the network a letter to predict we transform each of the possible and category values into an “on/off” switch represented by either a zero or a one. For any single in loan record only one value will be “on”:

Feature Scaling — This is a transformation of the continuous numeric features of our data so they’re all on the same scale. It was performed by the normalize() function in the AWS machine learning service previously. This prevents changes in features measured on a larger scale e.g. credit limit which could be measured in thousands of dollars from weight from weighing heavier than features on a smaller scale like number of accounts open.

Creating a Model

As we get more familiar with the data we’ll start working on composing a model to infer/predict useful information from the data, in our case we’re going to revisit predicting a loan letter grade in the scale A — G. We’ll use Keras, a python module which acts as a higher level API over some of the other popular deep learning libraries (Tensorflow, CNTK and Theano) to build this model. Different from the linear models the AWS Machine Learning service uses, we’ll create a neural network accepting our input data, passing it through hidden layers and outputting final layer of 7 values corresponding to the likelihood that the input represents a loan in each of the possible loan grade categories A — G. This model, due to non-linear activation functions between nodes in the network, can discover more complex relationships in the data we’re looking at and, as we’ll see, can deliver more accurate predictions.

Keras Neural Network Model for Lending Club Data

I used Keras and Jupyter from a local docker container (gw000/keras-full) with a volume mapped to a local directory containing the loan data. To avoid lengthy processing of the entire dataset on my local machine, I sampled a subset of 50K records from the full 400K+ dataset. This allowed me to get a rough sense of how adjusting various hyper parameters (e.g. number of hidden layers, number of hidden nodes in each layer, batch size, number of epochs, etc.) were impacting the model performance without the delay of crunching the entire dataset each time.

While optimizing ML model hyper parameters is well outside the scope of this article, graphing the accuracy and loss of the model on the training set alongside a held-out validation set helps us see both how well the model is performing overall and how much value there is in each successive training iteration. As the accuracy and loss start to level out there is a smaller improvement with each successive training iteration. The history object returned by Keras’ model.fit() method provides the data we need to generate these graphs:

Accuracy and loss graphs of the training and validation set over 45 epochs

GPU Training on a Shoestring

With some data analysis done and a model structure that seems suitable based on our local testing, now it’s time to train the model with our full dataset. Training a neural network involves a lot of mathematical computation for which GPUs are well suited. However, high-powered GPUs aren’t cheap; renting on-demand GPU instances on which to train our model with a large dataset is a cost-effective approach because we’ll only pay for the time we need and won’t incur the cost of a purchasing GPU hardware.

However, if we can be somewhat flexible in when our training takes place we can be even more cost efficient by training on a “spot” GPU instance. AWS offers their unused EC2 compute capacity at a rate that is typically much lower than the on-demand pricing; for the p2.xlarge instance it’s generally less than ⅓ the on-demand price (see chart).

Spot instance pricing for p2.xlarge instance in us-west-2 region

To use spot instances, you enter a bid amount and which instance types you are looking to use. You are charged the spot rate for your instances as long as the rate remains below your bid amount. If/when the rate exceeds your bid amount, AWS will automatically terminate your instances. If our model training gets interrupted we can either restart it again later or write logic to resume from where we left off. For our loan grade demo project, we’ll just keep it simple and assume that we’ll restart the training if we get interrupted; this loan data doesn’t actually take all that long to train.

So we’ve got cheap access to a GPU instance… What machine image can we use to do our training? Fortunately, Amazon provides the Amazon Deep Learning AMI, a public virtual machine image that packages several of the most popular open source deep learning frameworks including Keras and Tensorflow, along with GPU drivers compatible with Amazon’s P2 GPU instance types. This is a useful AMI at a great price: free.

Using spot instances is a little more work than simply launching a new EC2 instance but we can use a CloudFormation template to automate the process for us. The template will take in a few parameters and then provision a GPU instance and automate the training of our model. Here are the inputs and a quick description of what each controls in the creation of our model training stack:

InstanceType — Which variant of the GPU family do you want to run your workload on?
SpotBidPrice — How much are you willing to spend per hour for your instance? The instance type and spot bid price will be used in creating a “Spot Fleet Request” which will launch your instance (provided your bid is above the going “market” rate for the instance type)
KeyPairName — Key pair used to access the instance (if needed)
SourceCidr — What IP address range will be allowed to access the instance? Your launched instances will be added to a new security group which allows SSH and port 8888 (Jupyter default port) access only to the IP address range you specify. SSH access will require the use of the KeyPair specified by KeyPairName.
GitRepo — What Git repo contains your model logic/code?
GitBranch — What branch to pull
RunScript — A repository relative path to a shell script that will kick off your training logic. Once your instance is launched, a user data initialization script will handle some setup to publish logs from your instance to CloudWatch. It will also pull down code from a public facing Git repo and branch and execute a shell script to kick off your model training logic. This shell script will NOT be run as root.
OutputBucket — An S3 bucket that your instance will be given access to read from and write to. Your RunScript will need to pull in data from some location, this approach assumes it will be available in an S3 bucket. The AWS CLI will be available to pull down the data and later to copy trained model and any other training artifacts back into the bucket

To summarize, the CloudFormation template is going to create a GPU spot instance inside a VPC, pull down a Git repo and then execute a shell script from the repo for us. That script will perform the training and then need to save any results out to an S3 bucket for later use. At a minimum, In order to use the model subsequently for making predictions, we’ll want to save it to an HDF5 file with model.save().

If you happen to be following along in your own account, don’t forget to shutdown the CloudFormation stack when training is complete to avoid ongoing charges. This can be automated but got a little outside the scope of this article.

How’d We Do?

When running on the entire dataset, the Keras model was able to get accuracy (measured by the F1 Score) of .9435; a significant improvement over the 0.51 accuracy of the AWS Machine Learning service model. Here are the confusion matrices from the AWS Machine Learning service and the neural network side by side:

Comparison of the confusion matrices for AWS Machine Learning and a fairly simple neural network. The neural network model has a much higher F1 score on the same dataset.

Using Our Trained Model

At this point we’ve trained model parameters with our large dataset and saved our work (the trained model) out to S3. Taking a step back, we still haven’t improved any business process or customer experience one iota. We need to deploy the model into our systems and start using it to make loan grade predictions.

There are several options here:

Use the model from an internally hosted/deployed application
Deploy the model to a fleet of EC2 instances
Build a docker image with our saved model and add it to an ECS cluster
Create a serverless Lambda function to perform predictions using our model
etc.

Your own context will drive the right option; for this article we’ll look at the Lambda option. As a serverless compute option, Lambda not only scales seamlessly but it can also be triggered by a variety of sources in an AWS infrastructure. For example:

When combined with an API Gateway trigger the Lambda becomes an HTTP microservice
When using the AWS SDK, the Lambda function can be invoked directly from other application logic
The Lambda function could be invoked by DynamoDB streams to update new records as they are inserted into a table

Building a Lambda Deployment Package

However, we do have a bit of a challenge with Lambda; the maximum package size for Lambda is 50 mb compressed. If you follow the instructions for creating a Python deployment package and install Keras and Tensorflow with pip, you’ll find the resultant package is quite a bit larger than 50 mb compressed; well outside the Lambda package size limit.

To get around this we have to do some surgical removal of anything in the package that is nonessential for runtime. Have a look at build-lambda-pkg.sh which scripts the process of collecting Keras and a minimal set of runtime dependencies for inclusion in a Python Lambda deployment package. Getting under the 50mb limit was a somewhat painful process of trial and error. However, we can use the handy lambci/lambda docker images which simulates an actual Lambda runtime to quickly collect the the runtime files we need. Try launching the docker container from the directory where build-lambda-pkg.sh resides:

docker run -it -v `pwd`:/tmp/deploy/ lambci/lambda:build-python3.6 bash /tmp/deploy/build-lambda-pkg.sh

It will start a bash shell in a container and run the script which installs Keras, Tensorflow, etc and then strips out any unnecessary bytes before creating a keras-tf-runtime.zip file with the runtime needed for deployment. Where possible, the python source is stripped out and compiled python bytecode (.pyc) is left instead to reduce the initial lambda warm-up time.

Deploying to Lambda

With our keras-tf-runtime.zip file created we’re almost done. We now just need to add in the actual lambda code that will grab our model from S3, load it and generate loan grade predictions from it. The handler looks like this:

Checkout deploy-lambda.sh and lambda.template which automate a test of the Lambda by adding handler.py into our runtime zip file, upload the zip file to S3 and then generate an API Gateway endpoint which triggers the lambda function. Finally, deploy-lambda.sh uses curl to call the API gateway function and generate predictions.

Parting Thoughts

A machine learning pipeline is not a trivial undertaking. Getting a well-trained model may take many train-test iterations and there is still the question of how to incorporate new data into your pipeline. On the deployment side, the lambda approach demonstrated does have a challenge with a somewhat slow “cold-start” response time of several seconds. You can mitigate this to some extent by dialing up the memory size (which also dials up the CPU capacity) but subsequent requests to a “warm” instance perform quite quickly.

The goal here is to provide an outline and some useful code/templates as a starting place for a custom machine learning model that can be deployed on the AWS cloud and incorporated into your real-world applications.