Gone Modelin’

In Part 1, I described the beginning of my Python journey. Today I’ll dive more into the details of the model I am creating. All private information has been removed from axes, tables, and descriptions.

The Data

I am working with nearly two years of customer data, limited at the front end by when we started to record customer credit data and at the back end by the time lapse necessary to use 12-month default as the response variable. For predictor variables, I have a combination of customer attributes, vehicle attributes, and deal structure, including fields with binary, numeric, and categorical values.

The Exploration

In Part 1, I discussed the first piece of data exploration, identifying the shape of the data. Different fields varied widely. Credit score has a near normal distribution within the data:

customer credit scores

In other areas, there were clear effects of policy, like in down payment:

customer down payment

You can see clear spikes in down payment at regular intervals, the result of internal policy requiring a standard minimum across most of the inventory and increased down payment based on credit worthiness or vehicles price.

I have also spent time exploring the relationships between predictor variables and the response variable, default and repossession within 12 months. One of the most surprising examples was vehicle price. Here you can see the same histogram as earlier, with default rate overlaid:

Default rates appear to decline as vehicle price increases. Does having a nicer car make someone more likely to pay, despite a higher monthly payment, or is something else going on here? Here is the same chart, except broken up by our internal credit ratings (increasing in credit worthiness from top-left to bottom-right):

vehicle prices by credit worthiness

It looks like the relationship exists, especially at lower credit worthiness, but may be exaggerated by better customers defaulting at lower rates and purchasing more expensive cars. From what we know about our down payment policy, it is also possible car price has a neutral effect but increased down payment has a negative relationship with likelihood to default. We’ll need something more sophisticated to tease out the relationship between these variables and default rates.

The Model

I am using sklearn to create the prediction model. After importing the appropriate libraries, my next step is to create and assign the predictor and response variables using dmatrices from patsy and separate the training and testing data:

y, X = dmatrices(‘repo_in_12 ~ brand + poi + mo_pmt + amt_fin + call + auto_hist + emp_length + annual_inc + ltv + down + def_down + acv + down + vdc + trade_roll + pack + gross_profit + purch_type + credit_score’,df, return_type=’dataframe’)
X_train = X.loc['train']
X_test = X.loc['test']
y_train = y.loc['train']
y_test = y.loc['test']

I scaled the data using StandardScalar from sklearn and ran the model on the training data. The model assigned the following coefficients:

feature coefficients

The next step will be incorporate feature evaluation, selection, and combination to prune this list down to only the most significant and predictive features.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.