# Decision Tree & Random Forests

Complete Implementation From Scratch

Decision Tree is the most powerful and popular tool used for both classification and regression. It is like a flow-chart or we can a tree like model where every node depicts a feature while top node is called root node.

Key points about Decision Tree

- Simple Tree like structure, model can make decision at every point
- Easy explainablity as it it show how a decision is taken!
- Recursive And Greedy Algorithm
- Well defined logic
- Mimics Human Logic
- Features are preferred to be categorical

## Building Decision Tree can involve either of two criteria :

- Gini Index (CART — Classification And Regression Trees)
- Entropy (ID3 — Iterative Dichromiser 3)

Most crucial point in Decision Tree is how to decide which feature is to be picked at every point ! So these criteria specified above helps us in this.

## Gini Index

A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Following Formula gives us value of Gini Index.

## Entropy

Entropy is a measure of Randomness in a system.

We calculate

Information Gainusing Entropy which is to be maximized to get the correct feature a the node.

We will give all features to any of the above criteria and then for the one having maximum information gain we will be splitting data on the basis of threshold value or average value.

# Lets gets some hands on entropy’s code :

`def entropy(col):`

counts = np.unique(col,return_counts=True)

N = float(col.shape[0])

entropy = 0.0

for i in counts[1]:

p = i/N

entropy += (-1.0 * p * np.log2(p))

return entropy

- We will count the number of unique entries and total number of entries.
- Then Apply the formula for entropy.

## Dividing Data

We know that divide our data into left and right subtree so values greater than threhsold value will go to right side and less than ones to the left side.

fkey → Feature name

fval → Threshold Value

`def divide_data(x_data,fkey,fval):`

# work with Pandas dataframe

# creating 2 empty dataframe

x_right = pd.DataFrame([],columns=x_data.columns)

x_left = pd.DataFrame([],columns=x_data.columns)

for i in range(x_data.shape[0]):

val = x_data[fkey].loc[i]

if val > fval:

x_right = x_right.append(x_data.loc[i])

else:

x_left = x_left.append(x_data.loc[i])

return x_left,x_right

# Information Gain

`def information_gain(x_data,fkey,fval):`

# Split data

left,right = divide_data(x_data,fkey,fval)

# we will compute reduction in entropy after this entropy

# % fo people on left and right

l = float(left.shape[0])/x_data.shape[0]

r = float(right.shape[0])/x_data.shape[0]

# All examples can come to one side

if left.shape[0] == 0 or right.shape[0] == 0:

return -1000000 # Min information Gain

i_gain = entropy(x_data.Survived) - (l*entropy(left.Survived) + r*entropy(right.Survived))

return i_gain

- We split our data into left and right ones and then compute entropy.
- Formula of information is applied.

Now We are all set to jump to code of Decision Tree!!

`class DecisionTree:`

# Constructor

def __init__(self,depth=0,max_depth=5):

self.left = None

self.right = None

self.fkey = None

self.fval = None

self.max_depth = max_depth

self.depth = depth

self.target = None

def train(self,X_train):

features = ["Pclass","Sex","Age","SibSp","Parch","Fare"]

info_gain = []

for i in features:

i_gain = information_gain(X_train,i,X_train[i].mean())

info_gain.append(i_gain)

self.fkey = features[np.argmax(info_gain)]

self.fval = X_train[self.fkey].mean()

print("Making tree Feature is",self.fkey)

# Split Data

data_left,data_right = divide_data(X_train,self.fkey,self.fval)

data_left = data_left.reset_index(drop=True)

data_right = data_right.reset_index(drop=True)

# we have reached leaf node

if data_left.shape[0] == 0 or data_right.shape[0] ==0 :

if X_train.Survived.mean() >=0.5:

self.target = "Survived"

else:

self.target = "Dead"

return

# Stop early when depth >=maxDepth

if self.depth >= self.max_depth:

if X_train.Survived.mean() >=0.5:

self.target = "Survived"

else:

self.target = "Dead"

return

# Recursive Case

self.left = DecisionTree(depth=self.depth+1,max_depth=self.max_depth)

self.left.train(data_left)

self.right = DecisionTree(depth=self.depth+1,max_depth=self.max_depth)

self.right.train(data_right)

# Setting Target at every node

if X_train.Survived.mean() >=0.5:

self.target = "Survived"

else:

self.target = "Dead"

return

def predict(self,test):

if test[self.fkey]>self.fval:

# right

if self.right is None:

return self.target

return self.right.predict(test)

else:

if self.left is None:

return self.target

return self.left.predict(test)

We face Problem of overfitting in Decision Tree which means upto a certain point accuracy increases both in training and testing dataset but after a certain point it start decreasing for testing part. This implies that our code is not generalized.

To overcome this we can use two methods

- Post Pruning → In this we let the tree to grow fully and then remove the nodes which are not useful
- Pre- Pruning →We can set the maximum height of the tree which prevents tree to grow after a certain limit.

We have used Pre-Pruning In our code.

Other than this decesion tree also have high variance. Even if we make a a small change in our dataset it will high change in results.

To overcome this we come with idea of Ensembling. Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree.

The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

**Bagging**(Bootsrap Aggregation) here our main goal is to minimize variance of tree. Average of all the predictions from different trees are used which is more robust than a single decision tree.

** Random Forest** is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

2. **Boosting** is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

For example adaboost , gradient booster. In adaboost we need weights to records. Initially all are assigned same weights and create decision tree but with height =1. These are called *stumps*. For each and every feature we create stump.

- We will select the stump with minimum entropy or ginni coefficient.
- We calculate total error.
- Performance of Stump is calculated

4. We will increase the weights of incorrectly predicting class by *New Sample weight = weight X e ^ Performance of stump*

5. We will update the weights of correctly predicting class by *New Sample weight = weight X e^- Performance of stump*

6. Now we divide by sum of new weights to get normalized weights.

7. After this we create a new decision tree and the cycle repeats.

At last we decide from majority votes.