Credit Card Default Prediction with Logistic Regression

Intro: The goal is to predict the probability of credit default based on credit card owner’s characteristics and payment history.

About the data:

The datasets utilizes a binary variable, default on payment (Yes = 1, No = 0) in column 24, as the response variable. There are 23 features in this set:

  • 1 Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
  • 2 Gender (1 = male; 2 = female).
  • 3 Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
  • 4 Marital status (1 = married; 2 = single; 3 = others).
  • 5 Age (year).
  • 6 = the repayment status in September, 2005
  • 7:11 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
  • 12 = amount of bill statement in September, 2005;
  • 13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
  • 18 = amount paid in September, 2005
  • 19 = amount paid in August, 2005
  • 20 = amount paid in July, 2005
  • 21 = amount paid in June, 2005
  • 22 = amount paid in May, 2005
  • 23 = amount paid in April, 2005


  1. Data cleaning: the dataset is very neat, little modification is needed.
  2. EDA: by looking at the column names, I noticed there are columns with very similar names, which imply a potential multicollinearity problem may exist. So I made some plots of features with similar names, and the plots showed a strong correlation between each other, and that indicates feature selection is needed as the model I intended to use is regression. Here is one plot I created using Seaborn:

3. Feature engineering: before fitting my model, there are two things I need to do: standardize my numerical features and create dummies for my categorical features. Following is my code of doing standardization manually:

col_to_norm = ['limit_bal', 'age', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6',
'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6']

df[col_to_norm]=df[col_to_norm].apply(lambda x: (x-np.mean(x))/np.std(x))

4. Now I’m ready to fit my models. I defined 3 functions to: apply grid search to optimize model parameters and make prediction; plot confusion matrix; plot roc curve.

def gridsearch(model, params):
gs = GridSearchCV(model, params, scoring='roc_auc', n_jobs=-1), y_train)
print 'Best params: ', gs.best_params_
print 'Best auc on training set: ', gs.best_score_
print 'Best auc on test set: ', gs.score(X_test, y_test)
return gs.predict(X_test), gs.decision_function(X_test)
def plot_confusion(prediction):
conmat = np.array(confusion_matrix(y_test, prediction,
confusion = pd.DataFrame(conmat, index=['default',
'not default'], columns=['predicted default',
'predicted not default'])
print confusion
def plot_roc(prob):
y_score = prob
fpr = dict()
tpr = dict()
fpr[1], tpr[1], _ = roc_curve(y_test, y_score)
roc_auc[1] = auc(fpr[1], tpr[1])

plt.plot(fpr[1], tpr[1],
label='Roc curve (area=%0.2f)' %roc_auc[1], linewidth=4)
plt.plot([1,0], [1,0], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('false positive rate', fontsize=18)
plt.ylabel('true positive rate', fontsize=18)
plt.title('ROC curve for credit default', fontsize=18)
plt.legend(loc='lower right')

Note that I optimize my models based on the roc_auc score rather than accuracy score, that’s because for this kind of unbalanced data (only 20% of the observations default), it makes less sense to optimize accuracy. We care more about how our models are doing in predicting true positive (default), not negative.

5. I tried two approach to fit Logistic Regression: Stochastic Gradient Descent with logistic loss function and normal logistic regression with manually feature selection. I’m curious about how those to approaches would perform with the same dataset.

For stochastic gradient descent, I set penalty to Lasso regularization in order to drop some of the features to reduce multicollinearity to some degree.

sgd = SGDClassifier(loss='log', penalty='l1', learning_rate='optimal')

# use grid search to optimize parameters
sgd_params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0, 5.0], 'class_weight': [None, 'balanced']}

sgd_pred, sgd_prob = gridsearch(sgd, sgd_params)

Judging from the confusion matrix, the model is not very good at capturing true positive:

Following is the ROC curve of this model:

The next approach I tried was manually selecting features with recursive feature selection and fitting a normal logistic regression. Note that I set solver to liblinear since it’s a relatively small dataset and it’s a binary problem. And this time I will let grid search to decide which kind of penalty to use since I will do feature selection anyway.

lr = LogisticRegression(solver='liblinear')
lr_params = {'C': [0.001, 0.01, 0.1, 1, 10], 'class_weight': [None, 'balanced'], 'penalty': ['l1', 'l2']}

lr_pred, lr_prob = gridsearch(lr, lr_params)

After grid search, I reset the model parameters and fit RFECV:

lr = LogisticRegression(penalty='l2', C=0.1, solver='liblinear', class_weight='balanced')
rfecv = RFECV(estimator=lr, scoring='roc_auc')
model =, y_train)
lr_pred = model.predict(X_test)
lr_prob = model.decision_function(X_test)
print 'Test score: ', model.score(X_test, y_test)

Ideally, I should have done feature selection before model optimization, but then I will have to figure out which features are selected and reset my X. I was too lazy to do that.

The auc is the same as it is from sgd, and the roc curve is also almost identical. So let’s just check the confusion matrix:

Surprisingly, with almost the same roc-auc score, this model is much better in capturing true positive, aka default. So if I were the bank and I am focusing on risk control, I will use this model. But if I am more aggressive in expanding my business and I am ok with some of my clients default, I would choose the first model.


  1. The manual feature selection may be redundant as I applied regularization. Doing both at the same time might weaken the power of the model.
  2. There might still be multicollinearity with the models, as I did not check if the features correlated with each other were dropped. However, as long as my goal was only to make prediction, not identifying the significance of single feature, it would not be a big issue.

Here is the complete code.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.