Decision Tree & Random Forest Explained With Project

Prajeet Singh Kalchuri
Geek Culture
Published in
5 min readApr 27, 2021

--

This article is going to explain to you the Decision tree and Random Forest using a simple project!

Let's start with an easy example: Suppose you have to buy a new pen and now you have to decide or have to choose a single pen among various brands and you decide to use a “ Decision Tree Algorithm” to help you and you said that you want a pen and it will probably choose the most sold pen and will give you the result and you are happy with your decision.

Whereas On the other hand, your friend chooses “Random Forest Algorithm” and had made several decisions he chooses among different types of pen-like (Ball, Gel or Fountain Pen) and has also color options like (Black, Blue or any other color) and has the best value for money and he bought that pen considering various parameters and is now happiest, while you are still regretting your decision.

What is a Decision Tree?

It can be defined as a Supervised Machine Learning Algorithm, which uses a series of a sequential decision made to reach a specific result

Figure Source

Example : Firstly, it checks that if the our customer has a good credit history or not and based on that, it classifies the customer into two groups,Again it checks that the income of the customer and again classifies into two groups. And Finally, it checks the loan amount requested by the customer. Based on the outcomes from these three features, the decision tree decides if the customer’s loan should be approved or not.

What is a Random Forest?

It is also a Supervised Machine Learning Algorithm that combines the output of multiple (randomly created) Decision Trees to generate the final output. It can be also called a bunch of Decision Trees.

Figure Source

So Now let’s move on to our project which will help us to understand both Decision Tree and Random Forest Algorithms more clearly!

Importing Libraries | Importing the usual libraries for pandas and plotting.

Getting the Data

Using pandas to read loan_data.csv as a data frame called loans.

Checking out the info(), head(), and describe() methods on loans.

Data Analysis | Let’s do some data visualization!

We’ll use seaborn and pandas built-in plotting capabilities,

Create a histogram of two FICO distributions on top of each other, one for each “credit. policy ”outcome.

Creating a similar figure, except this time we are selecting by the “not.fully.paid” column.

Creating a countplot using seaborn showing the counts of loans by ‘purpose’, with the color hue defined by “not.fully.paid

The trend between ‘FICO score’ and ‘interest rate’. (using jointplot)

Creating the following lmplots to see if the trend differed between not.fully.paid and “credit.policy”.

Note : Check the documentation for lmplot() if you want to learn more about it .

Setting up the Data | Categorical Features

Check loans.info() again. You will see “purpose” column as ‘categorical’,

Which means that we have to transform it, using dummy variables so sklearn will be able to understand them.

We are using pd.get_dummies for do it one step

Let’s deal with these columns using a method that can be expanded to multiple categorical features if necessary

Create a list of 1 element containing the string ‘purpose’. Call this list cat_feats.

Now use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a fixed larger dataframe that has new feature columns with dummy variables.

Set this dataframe as final_data.

Train Test Split

Now it's time to split our data into a training set and a testing set!

We are going to use sklearn to split your data into a training set and a testing set .

Read this for better understanding for Train Test Split Method

Training a Decision Tree Model

Let's train a single decision tree first :

Create an instance of DecisionTreeClassifier() called “dtree” and fit it to the training data.

Predictions & Evaluation of Decision Tree

Creating predictions from the test set and create a classification report and a confusion matrix.

Classification Report
Confusion Matrix

Training the Random Forest model

Creating an instance of the RandomForestClassifier class and fit it into our training data from the previous step.

Predictions and Evaluation

We have to predict y_test values and evaluate our model

Predict the class of “not.fully.paid” for the “X_test” data

Now we have to create a classification report from the results !

Classification Report

Show the Confusion Matrix for the predictions.

Confusion Matrix

What performed better the random forest or the decision tree?

It clearly depends on which metric we are trying to optimize for ,
so for we can say that neither models did very well we need more feature engineering

When we can use the Decision tree:

  1. When we want our model to be simple and explainable
  2. When we want a non-parametric model

When we can use random forest :

  1. When we don’t bother much about interpreting the model but want better accuracy.
  2. On an unexpected validation data set, Random forest always wins in terms of accuracy.

--

--