Machine Learning with Python Part 2: Logistic Regression (Classification) and Decision Tree Classifier

judopro
4 min readJul 2, 2019

--

In Part 1 of this article series, we went over linear regression and decision tree regressor to see how to predict a continuous value such as interest rate. In this part, we will be looking at classification example such as given attributes of the borrower, can we approve or deny the loan based on the past data we have on other borrowers, rather than determining an interest rate for the loan.

We will implement two separate models, first being Logistic regression, and second being a Decision tree classifier, each has its own advantages and disadvantages depending on your data and at the end we will compare the accuracy of both approaches as we did before in Part 1.

DATASET

Again our training data set is hypothetically from a lending institution. It has only few columns such as Loan amount, loan title, Debt To Income ratio (dti), employment length of the borrower and whether the loan was approved or not.

Test data set has the same columns except approved, which is for our model to predict.

Logictic Regression (Classification)

Let’s go into implementation, first thing first, let’s import what we need.

Next we read the csv file into a Pandas dataframe, and then take out the “approved” column on its own, and remove it from the training data.

Then we convert our non-numerical values into numerical classes. (See Part 1 to read more about why we are doing this.)

And we initiate and train our model with the training data set and giving our expected results (whether those training loans were approved or not).

Now that our model is trained, we can give it the test data to predict the outcomes and write them to a file for performance measurement analysis later on.

Here is an important part, we are not using predict method, which would return us either it would approved or not. We are using predict_proba which returns an array of probabilities. For example for 1st borrower, it would return 2 values for us. First the probability of this loan being approved, second the probability of this loan being rejected. Their total should always be equal to 1. And if you have more classes than just 2, then you would have more probabilities for each class it. The reason I wanted to get the calculated probabilities, is to use my own cut-off to determine class membership later on.

Decision Tree Classifier

Decision tree classifier is not whole a lot different to implement than logistic regression.

We train our model with past data as before…

Finally, predict the outcomes for test dataset and write to csv file.

On line 3, the notation res[:,1] is really just keeping the 1 probability for each of the row values instead of 2 probabilities for each class (approved or rejected) and that is the probability of it being approved. That’s what we are interested in, and the probability of it being rejected is something we can calculate if we need to (by subtracting from 1).

RESULTS

We trained both logistic regression and decision tree regressor and saved the results (in second and third columns respectively) along with the actual values (in first column) as can be seen below.

When we look at some metrics our logistic regression had about 81% accuracy whereas the decision tree regressor did a much better job at almost 99% accuracy! These are calculated with the threshold value of 0.5 as you can see in top left corner.

Again I wanted to compare my results from python implementation to that of “R” using Radiant website.

As you can see there is a slight difference in logistic regression accuracy even though its pretty close (1479 vs 1483 true predictions) and classification seems to be almost at 99% for both platform implementations.

In next part, we will implement neural-networks to continue our series…

If you want to download the code and sample files for this exercise, please head over to my GitHub repo to download and play around with it.

--

--