In Part 1, I described the beginning of my Python journey. Today I’ll dive more into the details of the model I am creating. All private information has been removed from axes, tables, and descriptions.
I am working with nearly two years of customer data, limited at the front end by when we started to record customer credit data and at the back end by the time lapse necessary to use 12-month default as the response variable. For predictor variables, I have a combination of customer attributes, vehicle attributes, and deal structure, including fields with binary, numeric, and categorical values.
In Part 1, I discussed the first piece of data exploration, identifying the shape of the data. Different fields varied widely. Credit score has a near normal distribution within the data:
In other areas, there were clear effects of policy, like in down payment:
You can see clear spikes in down payment at regular intervals, the result of internal policy requiring a standard minimum across most of the inventory and increased down payment based on credit worthiness or vehicles price.
I have also spent time exploring the relationships between predictor variables and the response variable, default and repossession within 12 months. One of the most surprising examples was vehicle price. Here you can see the same histogram as earlier, with default rate overlaid:
Default rates appear to decline as vehicle price increases. Does having a nicer car make someone more likely to pay, despite a higher monthly payment, or is something else going on here? Here is the same chart, except broken up by our internal credit ratings (increasing in credit worthiness from top-left to bottom-right):
It looks like the relationship exists, especially at lower credit worthiness, but may be exaggerated by better customers defaulting at lower rates and purchasing more expensive cars. From what we know about our down payment policy, it is also possible car price has a neutral effect but increased down payment has a negative relationship with likelihood to default. We’ll need something more sophisticated to tease out the relationship between these variables and default rates.
I am using sklearn to create the prediction model. After importing the appropriate libraries, my next step is to create and assign the predictor and response variables using dmatrices from patsy and separate the training and testing data:
y, X = dmatrices(‘repo_in_12 ~ brand + poi + mo_pmt + amt_fin + call + auto_hist + emp_length + annual_inc + ltv + down + def_down + acv + down + vdc + trade_roll + pack + gross_profit + purch_type + credit_score’,df, return_type=’dataframe’)
X_train = X.loc['train']
X_test = X.loc['test']
y_train = y.loc['train']
y_test = y.loc['test']
I scaled the data using StandardScalar from sklearn and ran the model on the training data. The model assigned the following coefficients:
The next step will be incorporate feature evaluation, selection, and combination to prune this list down to only the most significant and predictive features.