Analytics Vidhya
Published in

Analytics Vidhya

Decision Tree & Random Forests

Complete Implementation From Scratch

Decision Tree is the most powerful and popular tool used for both classification and regression. It is like a flow-chart or we can a tree like model where every node depicts a feature while top node is called root node.

ey points about Decision Tree

  • Simple Tree like structure, model can make decision at every point
  • Easy explainablity as it it show how a decision is taken!
  • Recursive And Greedy Algorithm
  • Well defined logic
  • Mimics Human Logic
  • Features are preferred to be categorical

Building Decision Tree can involve either of two criteria :

  1. Gini Index (CART — Classification And Regression Trees)
  2. Entropy (ID3 — Iterative Dichromiser 3)

Most crucial point in Decision Tree is how to decide which feature is to be picked at every point ! So these criteria specified above helps us in this.

Gini Index

A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Following Formula gives us value of Gini Index.

Source Data-Stats


Entropy is a measure of Randomness in a system.

We calculate Information Gain using Entropy which is to be maximized to get the correct feature a the node.

We will give all features to any of the above criteria and then for the one having maximum information gain we will be splitting data on the basis of threshold value or average value.

Lets gets some hands on entropy’s code :

def entropy(col):

counts = np.unique(col,return_counts=True)
N = float(col.shape[0])
entropy = 0.0

for i in counts[1]:
p = i/N
entropy += (-1.0 * p * np.log2(p))

return entropy
  • We will count the number of unique entries and total number of entries.
  • Then Apply the formula for entropy.

Dividing Data

We know that divide our data into left and right subtree so values greater than threhsold value will go to right side and less than ones to the left side.

fkey → Feature name

fval → Threshold Value

def divide_data(x_data,fkey,fval):
# work with Pandas dataframe
# creating 2 empty dataframe
x_right = pd.DataFrame([],columns=x_data.columns)
x_left = pd.DataFrame([],columns=x_data.columns)

for i in range(x_data.shape[0]):
val = x_data[fkey].loc[i]
if val > fval:
x_right = x_right.append(x_data.loc[i])
x_left = x_left.append(x_data.loc[i])

return x_left,x_right

Information Gain

def information_gain(x_data,fkey,fval):
# Split data
left,right = divide_data(x_data,fkey,fval)
# we will compute reduction in entropy after this entropy
# % fo people on left and right
l = float(left.shape[0])/x_data.shape[0]
r = float(right.shape[0])/x_data.shape[0]

# All examples can come to one side
if left.shape[0] == 0 or right.shape[0] == 0:
return -1000000 # Min information Gain

i_gain = entropy(x_data.Survived) - (l*entropy(left.Survived) + r*entropy(right.Survived))
return i_gain
  • We split our data into left and right ones and then compute entropy.
  • Formula of information is applied.

Now We are all set to jump to code of Decision Tree!!

class DecisionTree:

# Constructor
def __init__(self,depth=0,max_depth=5):
self.left = None
self.right = None
self.fkey = None
self.fval = None
self.max_depth = max_depth
self.depth = depth = None

def train(self,X_train):

features = ["Pclass","Sex","Age","SibSp","Parch","Fare"]
info_gain = []

for i in features:
i_gain = information_gain(X_train,i,X_train[i].mean())

self.fkey = features[np.argmax(info_gain)]
self.fval = X_train[self.fkey].mean()
print("Making tree Feature is",self.fkey)
# Split Data
data_left,data_right = divide_data(X_train,self.fkey,self.fval)
data_left = data_left.reset_index(drop=True)
data_right = data_right.reset_index(drop=True)

# we have reached leaf node
if data_left.shape[0] == 0 or data_right.shape[0] ==0 :
if X_train.Survived.mean() >=0.5: = "Survived"
else: = "Dead"

# Stop early when depth >=maxDepth
if self.depth >= self.max_depth:
if X_train.Survived.mean() >=0.5: = "Survived"
else: = "Dead"
# Recursive Case
self.left = DecisionTree(depth=self.depth+1,max_depth=self.max_depth)

self.right = DecisionTree(depth=self.depth+1,max_depth=self.max_depth)

# Setting Target at every node
if X_train.Survived.mean() >=0.5: = "Survived"
else: = "Dead"

def predict(self,test):
if test[self.fkey]>self.fval:
# right
if self.right is None:
return self.right.predict(test)
if self.left is None:
return self.left.predict(test)

We face Problem of overfitting in Decision Tree which means upto a certain point accuracy increases both in training and testing dataset but after a certain point it start decreasing for testing part. This implies that our code is not generalized.

To overcome this we can use two methods

  • Post Pruning → In this we let the tree to grow fully and then remove the nodes which are not useful
  • Pre- Pruning →We can set the maximum height of the tree which prevents tree to grow after a certain limit.

We have used Pre-Pruning In our code.

Other than this decesion tree also have high variance. Even if we make a a small change in our dataset it will high change in results.

To overcome this we come with idea of Ensembling. Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree.

The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

  1. Bagging (Bootsrap Aggregation) here our main goal is to minimize variance of tree. Average of all the predictions from different trees are used which is more robust than a single decision tree.

Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

2. Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

For example adaboost , gradient booster. In adaboost we need weights to records. Initially all are assigned same weights and create decision tree but with height =1. These are called stumps. For each and every feature we create stump.

  1. We will select the stump with minimum entropy or ginni coefficient.
  2. We calculate total error.
  3. Performance of Stump is calculated

4. We will increase the weights of incorrectly predicting class by New Sample weight = weight X e ^ Performance of stump

5. We will update the weights of correctly predicting class by New Sample weight = weight X e^- Performance of stump

6. Now we divide by sum of new weights to get normalized weights.

7. After this we create a new decision tree and the cycle repeats.

At last we decide from majority votes.



Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store