‘Give me some credit’ — machine learning approach

There are many supervised algorithms that can be used to train a classifier. The problem is that it is hard to say which one will be the best fit for your data — there is no golden algorithm that always win competition with others. To train a good predictive model you need to check as many as possible algorithms and select the one that fits your data the best (base on selected metric and validation schema). In this tutorial I will show you how to easily check many algorithms on credit scoring task.

Get data!

The data I will use is from past Kaggle competition (link for data). We will download the training dataset (cs-training.csv file) which will be used for models training and test data (cs-test.csv file) that we will use to compute predictions and submit to Kaggle.

  1. At first let’s create a project with Binary Classification task.
Start new project at mljar.com

2. Upload a training dataset (cs-training.csv file).

Data upload in mljar.com service

3. Select target column. In this analysis SeriousDlqn2yrs will be our target variable. We will train model to predict this feature. Please remember to set first column Unamed: 0 as Don’t use it — it is a id column and it is not needed for model training. After selecting usage for columns please accept column usage (green button at the top).

4. We are now ready to run Machine Learning Experiment! Please go to experiments and click Add new experiment. We will use 10-fold CV with shuffle and stratification for model validation (classes in dataset are unbalanced that’s why it is good to use stratification and shuffle). There are missing values in the dataset, which will be filled with median values. We will use all available algorithms. The metric that we will optimize is Area Under Curve (AUC). The AUC is from 0 to 1 range, and the higher the better. We set a training time limit for 20 minutes for each model (you can set lower if you don’t have enough computational credits). Setting experiment is done, so we are ready to start, just click Create & Start button at the bottom and all machine learning magic will be started.

Setting machine learning experiment

5. After starting training you will be redirected to Results page. At the beginning all models are initialized. We selected tuning mode Sport which means that for each ML algorithm will be checked from 10 to 15 different hyper-parameters settings. (Hyper-parameters are values that control the process of ML algorithm training). You need to wait a while till all models will be trained.

6. After a while you should get results like below:

ML experiment results

You can click on each model and check its hyper-parameters values and learning curves. Below is example result for Extreme Gradient Boosting algorithm. From learning curve you can see that it was trained with 250 trees. However, the model state that was saved and will be used for computing predictions has 150 trees. MLJAR detects that performance decrease on test folds during cross validation and store only best performing model (with 150 trees).

Result for selected model

You can observe that each algorithm has different performance. Before running analysis it is hard to say which one will be good for your data. That’s why it is good to check as many as possible.

Performance of classifiers trained by different algorithms

7. To compute predictions on test data. First you need to upload cs-test.csv file and select column usage — in the same way as you did train data. Then, please go to Predict in menu and select test data, model (with highest score) and click Start Prediction. That’s all, now wait a while till predictions are computed and they will appear at the bottom. Let’s download prediction from Ensemble model.

8. We will use downloaded predictions to submit to Kaggle competition page. Before doing this, we need to fix header in predictions. The column names should be: Id, Probability.

9. OK, we are ready to submit. Below is score from Kaggle system. When you compare it with results on Private Leaderboard, you will see that results are very good — it is in TOP 10! :)

In this analysis we used many different Machine Learning algorithms to train classifier. As you see, each of algorithm has different performance (the AUC score). To find good model you need to check many training algorithms but usually people don’t do this because of lack of time:

  • they don’t have time to write code for model training with many algorithms, very often each algorithm has different interface, different data format which makes writing code for each very time consuming — in mljar.com there is one interface for many algorithms
  • they don’t have time to wait for training many algorithms, usually models are trained sequentially. Example: training of each model takes 30 minutes then to check 100 models, you need 50 hours (2 days, 6 hours) — in mljar.com all computations are distributed in the cloud, to speed up model training we launch for you up to 12 machines, so training of 100 models (30 minutes each) can be done on mljar in 4.5 hour!
  • one more thing, when you are checking 100 models sequentially are you sure that you save all results and you will be able to go to them back? In mljar all your results are saved, so you can check them at any time.

Don’t waste your time for model searching and use mljar.com to do it for you! :) If you are looking for more Machine Learning lessons please take a look at MLJAR Academy.