Machine Learning with Python Part 1: Linear Regression and Decision Tree Regressor

judopro
7 min readJul 2, 2019

--

I am enrolled in Machine Learning @ MIT Data to Decisions Program and although the program doesn’t include “programming” assignments, some of us would like to get some more hands on experience on Machine Learning. So I wanted to take some of the weekly assignments we ran on a custom UI tool (Radiant), and implement the models using python.

Radiant is actually a great tool, it is implemented in the most popular ML language R and it has a nice UI around it. I personally loved using it, and I will definitely keep it in my arsenal of tools. But I chose python for this exercise because I wanted to get more involved with underlying data structures and algorithms in python and how they work. Python is also a great choice because it has extensive support for Machine Learning algorithms and models from the open-source community and you can find many code samples online. We will be using scikit library. One of the most popular libraries that offers a lot of tools out of the box for us.

I am also not going to get into what Linear Regression is etc. I am assuming you already know what that is and interested in how to use it in python. If so, then keep reading. There were a lot of samples I found online but they were missing one or another instruction to completely implement my model so I decided to write my own experience along with issues I encountered and how I circumvented them.

We will implement two separate models, first being Linear regression, and second being a Decision tree regressor, each has its own advantages and disadvantages depending on your data and at the end we will compare the accuracy of both approaches.

DATASET

The dataset we have is hypothetically from a lending institution. It has data on borrowers such as Loan amount, interest rate, loan title, income etc… The purpose of the exercise is to use the training data to create a model that can predict what should be interest rate for a loan given all other attributes of the borrower (i.e. loan title, income level, employment length, other utilities etc.)

We have 2 files, Regression_Training.csv and Regression_Test.csv. We will use the first file to train our model, then we will use the second file to predict and test the accuracy of our model. Training file has a column int_rate which helps our model with its learning. It is also the value we are trying to predict for the test file.

Linear Regression implementation:

The first 3 lines are really nothing but declaring what libraries we will be using. If you don’t have these libraries installed, please see the Readme file on my GitHub repo on how to install them.

Pandas is a great library that allows us to load our data and be able to do a lot of cool things with it easily. LinearRegression is what we are going to use, so that’s that, and preprocessing will allows us to convert our categorical(non-numerical) columns such as Loan title into numerical classes so the linear regression can work on it (As you already know linear regression works on only numerical values). Assume we have two loan titles, “Home Mortgage”, and “Auto Loan”. If we directly use this data for LinearRegression we will get

“ValueError: could not convert string to float: ‘Home Mortgage’ ”

But there is a workaround. We can simply assign numerical values to the categorical values, for example Home Mortgage is assigned 0, and Auto Loan is assigned 1. So now our values are either 0 (representing home mortgage), or 1 (representing auto loan), or even more if there are more values in the category, but you get the point… So preprocessing helps us do this categorical<->numerical transformation and vice versa.

OK.. Now to the more interesting part…

The next 2 lines reads our training and test files using pandas and return us Pandas DataFrame object which is a very common structure to use.

Line 4, extracts int_rate as a single column from the file (meaning it takes int_rate for every single row and creates a dataframe with single column in it and all the corresponding values for that column in the data as the values in the rows). You can see below for example, first row has int_rate 0.163684, second row 0.118775 and so on…

Line 5 removes int_rate column from the data, so that our model won’t see it. This is the parameter we are trying to learn to predict, therefore it can not be part of the independent variable set. Same goes for column ‘Grade’. This is an internal column assigned by the institution and the grade they assign very closely determines the interest rate, so given Grade, you can guess the interest rate pretty accurately, BUT, when a new customer walks in, there is no Grade yet assigned so the idea is to ignore it from past data and work with all other attributes of the borrower himself, not one of calculated fields. (There is more on this subject as to what happens when independent parameters are actually correlated, resulting in multicollinearity) so let’s remove it and keep it simple as it can be.

Next we will take care of categorical columns to transform into numerical using preprocessing. encode_labels basically takes a dataframe and check each column type, if it is not numerical, then transforms it using the LabelEncoder so we can feed it to our model.

Up until now, we set up our data, the below single line of code is actually what it takes for our model to learn the relationship. Here we are giving it the training data set (excluding the int_rate column which we dropped) and then in the second parameter, we are giving the int_rate values. Since Linear regression is a supervised learning, we are giving what the attributes of the past customers were, and then what the int_rate that was given to their loans so our models learns the relationships between parameters and how they end up affecting the int_rate.

Once the model is trained, we want to get predictions of int_rate of our test data. Our test data doesn’t have int_rate column, so our model has no way of knowing it. All it can do is try to predict it using other parameters. The last two lines are really only to write the calculated int_rate to a csv file to do a performance analysis later on.

Decision Tree Regressor implementation:

Once we implemented Linear regression, we will use pretty much the same code for decision tree regressor, except following :

We will import DecisionTreeRegressor and initiate it.

Here I am setting up max_depth=16. If you omit it (which I first did as well!), then the system basically overfits the data by creating as deep of a tree as it can possibly do to best fit the training data. But the problem with that is, it won’t be very accurate with the test data. So we want something that will provide us a good bias-variance tradeoff. When I omitted, system created depth of 37, so I decided to use half of that for my exercise. You are more than welcome to adjust this parameter and experience yourself.

Next we call predict() on our model and write out the results to a csv file to compare performance.

RESULTS: The winner is….

Once I exported both models output, I was able to compute their performances. We have a separate spreadsheet that provides us the actual int_rate values for the test data. As you see below, first column is the actual int_rate, second column is the predicted int_rate from our Linear Regression model, and the third column is the predicted int_rate via our Decision Tree Regressor.

By looking at 5,000 rows we won’t be able to determine which one did a better job. So quickly calculating some metrics help us see the big picture. As you can see below, the Root Mean Square Error (RMSE) of Linear Regression is a little better, however Mean Absolute Error (MAE) is slightly better with our decision tree model. Depending on which error metric you want to go with, you can decide which model fits your data best and go with it.

SUMMARY

This is it! We implemented 2 models with python, trained them with training data, then predicted values for test data and then compared their performances.

I also wanted to compare my results to the results I received when I ran the same data set and models on Radiant. First table below is the calculated metrics from Radiant, and right below is the metrics from the python model.

Radiant (R) model metrics

Python model metrics

Looks like the results for Linear Regression was pretty similar for both R and python implementations. Decision Tree Regressor did slightly better using R with respect to RMSE, however MAE was pretty much similar also. There could be room for some more optimization with various parameters of the decision tree (max depth etc) that we may be able to fine tune if we want to further dig it down.

If you want to download the code and sample files for this exercise, please head over to my GitHub repo to download and play around with it. This is my first article on Medium so feel free to comment below and let me know your feedback.

--

--