Decision Tree (CART) Algorithm in Machine Learning

Amir Ali
Amir Ali
Jul 13, 2018 · 15 min read

This is one of the best introductions to Decision Tree (CART) algorithm. The author introduces the algorithm with a great explanation about Decision tree Algorithm

This article spans six parts:

In this Third Chapter we will discuss about decision Tree Algorithm which is which also Called ‘CART’ used for both classification and regression problem too and It’s supervised machine learning algorithm.

This chapter spans 3 parts:

  1. · What is Decision Tree?

4.1: What is Decision Tree (CART)?

A decision tree is a largely used non-parametric effective machine learning modeling technique for regression and classification problems. To find solutions a decision tree makes sequential, hierarchical decision about the outcomes variable based on the predictor data.

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

The Understanding Level of Decision Tree algorithm is so easy as compared to classification algorithm.

In Decision tree algorithm we solve our problem in tree representation.

Each internal node of the tree corresponds to an attributes.

Each leaf node corresponds to a Class Label.

In decision tree for predicting a class label for a record we start from the root of the tree. We compare the value of the root attribute with record’s attribute on the basis of comparison. We follow the branch corresponding to that values & jump to the next node. We continues comparing our record’s attribute value with other internal nodes of the tree Until we reach a leaf node.

In Decision tree we used

  1. Information Gain:

Here what is P and n? So to find p and n we check our class attribute or outcome which binary (0, 1). So for p we take true value 1 (in case binary) and for no we take false value 0( binary value).We go more deeper in mathematical part just here introduction.

2. Entropy:

Entropy is basically used to create a tree. We find our entropy from attribute or class.

3. Gain:

Gain is basically used to find one by one attribute of our training set.

4.2: How Decision tree Work?

4.2.1: For Classification:

This dataset contain information about company which going on profit and lose. This dataset have four attribute which is Age, Competition, Type and Profit. Profit is our target attribute. As a Data Scientist our target is make tree using the decision tree approach and gave the right decision which company take the maximum profit and chances of lost is minimum. So let’s start

First of all I find information gain from Target attribute which is up and down

P: up = 5

P: down = 5

Information Gain

So I calculate the information gain of overall dataset which is equal to 1.

After finding the information gain calculate the Entropy of each attribute

Entropy (Age)

So in age attribute we have tree categorical data which is old, mid, new and find the information gain of each as shown above table.

So Entropy of Age

After find the information of each (old, mid , new) put the value in below entropy equation to find the entropy of age.

So entropy of age is 0.4

After find the entropy we can easily calculate the gain.

Gain (Age):

So the gain of age is 0.6. Similarly, we can calculate the remaining attributes Gain.

Entropy (Competition)

So in competition attribute we have two categorical data which is yes and no and find the information gain of each as shown above the table.

So Entropy of competition:

After finding the information about each (yes, no) put the value in below entropy equation to find the entropy of competition.

So entropy of competition is 0.4

After finding the entropy we can easily calculate the gain.

Gain (competition):

So the gain of competition is 0.1245.

Entropy (Type)

So in type attribute we have two categorical data which is software and hardware and we find the information gain of each as shown above the table.

So Entropy of type

After finding the information of each (hardware, software) put the value in below entropy equation to find the entropy of type.

Gain (type):

So Gain of type is 0.

Now We have calculated Gain of all attribute and we used this information to make a tree

Gain of (age, competition, type)

Gain of age = 0.6

Gain of competition = 0.12

Gain of type = 0

So our step is that check the Maximum Gain of all the attribute and make a root node in our case age have a max gain, which is equal to 0.6 as compare to other so age become a root node.

So age have three attribute which is Old, mid and new.

So for old age target attribute all goes down

So make a leaf node for old which is down on the basis of our dataset information.

So for new age target attribute all goes up

So make a leaf node for new age, which is up on the basis of our dataset information.

S o we make a leaf node for old and new now what is for mid?

Because mid age have 2 cases:

So who becomes the child node of mid. To solve this problem we take again dataset for only mid calculate the gain for Competition and Type and make a child node of mid.

To calculate the Gain and Entropy similar above procedure.

P: yes = 2

P: no = 2

Information Gain

So I calculate the information gain of above dataset which is equal to 1.

After finding the information gain calculate the Entropy of each attribute.

Entropy (Competition)

So in competition attribute we have two categorical data which is yes and no and find the information gain of each as shown above the table.

So Entropy of competition

After finding the information about each (yes, no) puts the value in below entropy equation to find the entropy of competition.

So entropy of competition is 0

After finding the entropy we can easily calculate the gain.

Gain (competition):

So Gain of competition is 1

Entropy (Type):

So in type attribute we have two categorical data which is software and hardware and we find the information gain of each as shown above the table.

So Entropy of type

After finding the information of each (hardware, software) put the value in below entropy equation to find the entropy of type.

Gain (type):

So Gain of type is 0.

Now who is a child node of mid check the Gain of type and competition:

Gain of competition = 1

Gain of type = 0

So Gain of Competition is Greater than type, so Competition become the child node of mid.

So

So make a leaf node for competition if competition is then profit will down and if competition no then profit up.

So our tree has been completed, we use a Gain from all the attribute and make it. According to our tree we can take only those decisions which company gains maximum profit

1st: So Profit if age is new .

2nd: profit if age is mid and no competition.

So we use this imported information to give company profit.

4.2.2: For Regression:

Dataset:

A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

Standard Deviation

A decision tree is built top-down from a root node and involves partitioning the data into subsets that

Contain instances with similar values (homogenous). We use standard deviation to calculate the

Homogeneity of a numerical sample. If the numerical sample is completely homogeneous its standard

Deviation is zero.

a) Standard deviation for one attribute:

· Standard Deviation (S) is for tree building (branching).

· Coefficient of Deviation (CV) is used to decide when to stop branching. We can use Count (n) as well.

· Average (Avg) is the value in the leaf nodes.

b) Standard deviation for two attributes (target and predictor):

Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation after a dataset is

split on an attribute. Constructing a decision tree is all about finding attribute that returns the

highest standard deviation reduction (i.e., the most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 9.32

Step 2: The dataset is then split on the different attributes. The standard deviation for

each branch is calculated. The resulting standard deviation is subtracted from the

standard deviation before the split. The result is the standard deviation reduction.

Similarly find standard deviation for temp , Humidity and wind as shown below table:

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

Step 4a: The dataset is divided based on the values of the selected attribute. This process is run recursively on the non-leaf branches, until all data is processed.

In practice, we need some termination criteria. For example, when coefficient of deviation (CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few instances (n) remain in the branch (e.g., 3).

Step 4b: “Overcast” subset does not need any further splitting because its CV (8%) is less than the threshold (10%). The related leaf node gets the average of the “Overcast” subset.

Step 4c: However, the “Sunny” branch has an CV (28%) more than the threshold (10%) which needs further splitting. We select “Windy” as the best best node after “Outlook” because it has the largest SDR.

Because the number of data points for both branches (FALSE and TRUE) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node.

Step 4d: Moreover, the “rainy” branch has an CV (22%) which is more than the threshold (10%). This branch needs further splitting. We select “Windy” as the best best node because it has the largest SDR.

Because the number of data points for all three branches (Cool, Hot and Mild) is equal or less than 3 we stop further branching and assign the average of each branch to the related leaf node.

When the number of instances is more than one at a leaf node we calculate the average as the final value for the target.

3.3: Practical Implementation in Scikit Learn.

3.3.1: Classification approach:

Dataset Description:

This Dataset has completely about Social Network Ads and in the dataset we have 400 instances and 5 attributes which is User_ID, Gender, Age, Estimate_Salary and last is Purchased which Target attributes.On this basis of Salary and age user purchased a social network ad or not and our Challenges to make a prediction of our data set.To do this we use a Decision tree classifier model and predict our target attribute after prediction we make a confusion matrix and compare our original result and predicted result and then calculate the accuracy

Part 1: Data Preprocessing

In this first part we are working with data preprocessing.

First of all we import Libraries numpy for multidimensional array pandas for importing the dataset and matplotlib for visualizing the result.

After importing the Libraries using the pandas library import the dataset which is in csv format and then we define the our predicting and predictor using the iloc.

In this line of code we split our dataset into train and test to do this we import train_test_split from sklearn.model selection. And here we test 25% of the whole dataset and 75% train.

Our dataset in regression which have salaries attributes and age attributes which have some value too large and some value too small and machine learning underestimate the value to save from this problem we do our value in feature scaling in 0 or 1.

Part 2: Building the decision tree classifier model:

In this part we are building the decision tree classifier model to do this, first we Import our model from scikit learn model.

In this step we initialize our model. Note down here I passed the parameter criterion basically it’ purpose (the function to measure the quality of a split. Supported criteria are “gini” for the Gini index impurity and entropy for the information gain)

After initialize our decision tree classifiers model we train the model using the fit method

Part 3: Making the Prediction and Visualizing the result:

In the final step we make the prediction and visualizing the result.

For prediction we use the predict method and predicted our dataset on the basis of test set.

After prediction our result we make a confusion matrix on the basis of prediction result and original result

After making confusion matrix we can check the accuracy 62+29/62+29 + 6+ 3=0.91

In the visualizing step we make a graph on the basis of test set result and in visualizing step we clearly check our prediction result understand better confusion metrix.

If you want dataset and code and code you also check my Github profile click on link.

3.3.2: Regression approach

Dataset Description:

This dataset contains the information about position salaries, which have three attribute as we see below sample dataset which have three attribute position, level of work, and salary according to level. This dataset have 10 instances. As we see this is regression problem and we use the decision tree regression model to predict the salaries on the basis of Level and position.

Part 1: Data Preprocessing:

In this first part we working with data preproessing.

Firstof all we import Libraries numpy for multidimensinal array pandas for imprting the dataset and matplotlib for visulising the result.

After importing the Libraries using the pandas library import the dataset which is in csv format and then we define the our predicting attribute and our X is Level and Y is Salary according to our dataset.

Note: This dataset is too small only 10 instances so we not split the data in to train and test set

Part 2: Building the decision tree Regression model:

In this part we building the decision tree Regression model to do this first we Import our model from scikit learn model.

In this step we initialize our model our decision tree regressor model.

After initialize regressor model fit the dataset into our model

Part 3: Making the Prediction and Visualizing the result:

In the final step we make the prediction and visualizing the result.Note our dataset is in real value which regression so in predict method we enter the value which 6.5( level of position) and our model gave the prediction 15000.

In Visualizing step we make a graph between position level and salary which is our predicted output. So on the basis of level and salary set the point(red color) and then predict the salary which blue line.

If you want dataset and code you also check my Github profile click on link.

End Notes:

If you liked this article, be sure to click ❤ below to recommend it and if you have any questions, leave a comment and I will do my best to answer.

For being more aware of the world of machine learning, follow me. It’s the best way to find out when I write more articles like this.

You can also follow me on Github for code and dataset,Twitter and Email me directly or find me on linkedin. I’d love to hear from you.

That’s all folks, Have a nice day :)

Wavy AI Research Foundation

This publication is completely about Machine Learning and Deep Learning articles from beginner to advanced.

Amir Ali

Written by

Amir Ali

Co-Founder & Chief AI Research Scientist at Wavy AI Research Foundation

Wavy AI Research Foundation

This publication is completely about Machine Learning and Deep Learning articles from beginner to advanced.