Decision Tree- Construction in ML

Yamini
Geek Culture
Published in
8 min readApr 12, 2021

In our daily life, we make enormous decisions from the start of our day to the end of our day. There are many times we use decision trees in our lives unknowingly. We have all about studied decision trees in our schools and colleges. How does it be everyone knows. A tree-like structure that starts with the root node and ends with leaf nodes. There are many other algorithms in ML based on the Decision trees.

Before getting into this if you want to know about the Data Science methodology of problem-solving. Visit Introduction to Data Science. Next, dive into the topic

What makes a decision tree such an important and useful algorithm? It makes it important because of the way it makes the decisions. Lets us study about it more. It is such an important model for making decisions so that our model makes the right prediction for the regression or classification data.

A decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Decision tree and its terms

Decision trees are also used in every company before launching any product, maintaining the standards, and updating or closing it. It helps in saving time, income and making profits without getting into major losses like Break-even analysis, etc. If you want to know about how companies use it. Read this on the Harvard Business Review page. Click here to read.

How is it making accurate predictions? How is the split made and which one is on the top? What is the process happening internally? Okay, let's get into it to find how is it doing it all just like you do or just as a human does. Yes, people prioritize things so that their decision-making becomes clear. When you have two options like you have to attend the meeting or a function or stay at where you are, which one do you choose? Just the things which are important and right at the point based on the interests and importance of the situation. That’s it. The same thing is here then let's dig and find some terminology and have some understanding in terms of ML.

The first node(root node) splits the data into different decision points and if conditions satisfied it may be further split based on another feature or makes the required decision or interpretation. The same happens on different decision nodes. The end nodes can be called terminal nodes or leaf nodes of a parent node(decision node).

There are few important metrics by which splitting of a node happens. They are:

  1. Entropy: It is used to measure the impurity and randomness in the dataset. When a ball is picked from a bunch of the same color balls(let's say green) or one pearl picked from a basket of pearls. From these examples, it is the same one you pick even you to pick for n times. So the entropy is zero. When there are all different colors of balls then the probability of getting that ball decreases completely from 1 to 0.9. If there are 20% yellow,10%white,30% blue balls, and 40% of green. Then the probability of getting only green balls is 0.4. As the differences and choices increase the entropy gets higher means the probability of getting the particular ball gets down. Mathematically, Shannon’s entropy is calculated using the below formula:
Entropy-The impurity measure

After calculating entropy for a particular ball or on the particular thing which is to be known Information gain is calculated on it, to know how much information we can get from that particular feature so that the target variable is predicted correctly. The feature with the highest information gain becomes the root node through which split is made and are further divided so that prediction accuracy increases. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute value(probability of required feature). Information gain is calculated by subtracting the probability of a particular ball or feature from the total entropy of all the balls or features in the data.

Information gain with example

Information gain is biased for the attribute with many outcomes. It means it prefers the attribute with a large number of distinct values. For instance, consider an attribute with a unique identifier such as customer_ID that has zero info(D) because of pure partition. This maximizes the information gain and creates useless partitioning.

C4.5, an improvement of ID3, uses an extension to information gain known as the gain ratio. Gain ratio handles the issue of bias by normalizing the information gain using Split Info.

Gain ratio calculation

2. Gini: Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Gini index is also the type of criterion that helps us to calculate information gain. It measures the impurity of the node and is calculated for binary values only.

Gini
Example for Gini
Gini impurity

In the case of a discrete-valued attribute, the subset that gives the minimum Gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with a smaller Gini index chosen as the splitting point.

The Gini index is the Gini coefficient expressed as a percentage and is equal to the Gini coefficient multiplied by 100. (The Gini coefficient is equal to half of the relative mean difference.) The Gini coefficient is often used to measure income inequality.

Range of gini and entropy

Note that we introduced a scaled version of the entropy (entropy/2) to emphasize that the Gini index is an intermediate measure between entropy and the classification error

  • The Gini criterion is much faster because it is less computationally expensive.
  • The obtained results using the entropy criterion are slightly better.

Let’s look at some of the decision trees in Python.

1. Iterative Dichotomiser 3 (ID3): This algorithm is used for selecting the splitting by calculating information gain. Information gain for each level of the tree is calculated recursively.

2. C4.5: This algorithm is the modification of the ID3 algorithm. It uses information gain or gain ratio for selecting the best attribute. It can handle both continuous and missing attribute values.

3. CART (Classification and Regression Tree): This algorithm can produce classification as well as regression tree. In the classification tree, the target variable is fixed. In a regression tree, the value of the target variable is to be predicted.

  • The decision tree implemented in Python is through sklearn library and it uses CART. In CART there are two methods through which we get the information for prediction. They are Gini and entropy. You have to go to other libraries to implement other decision trees. You can check chefboost etc. You can refer to this GitHub for more details on it for implementing https://github.com/serengil/chefboost

There are pros and cons for this algorithm too:

Pros:

  1. It forces the algorithm to take into consideration all the possible outcomes of a decision and traces each path to a conclusion.
  2. Decision trees are easy to interpret and visualize.
  3. It can easily capture Non-linear patterns.
  4. A lot of business problems can be solved using Decision Trees. They find their applications in the field of Engineering, Management, Medicine, etc. basically, any situation where data is available and a decision needs to be taken in uncertain conditions.
  5. It requires less data processing and even there requires no normalization or standardization for the variables. Decision Trees are very resilient and are able to handle a fair percentage of data abnormalities(like outliers, missing values, and noise) quite well without altering the results.
  6. It can be used for feature engineering such as predicting missing values, suitable for variable selection.
  7. It can be used for both regression and classification tasks.

Cons:

  1. Decision Trees are very sensitive to hyperparameter tuning. Unfortunately, the output of a decision tree can vary drastically if the hyperparameters are inaccurately tuned.
  2. Decision trees are prone to overfitting. Decision trees are prone to overfitting if the breadth and depth of the tree are set to very high for a simpler dataset.
  3. decision trees are also prone to underfitting. The decision tree suffers from underfitting if the breadth and depth of the model or the number of nodes are set too low. This does not allow the model to fit the data properly and hence fails to learn.
  4. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.

To learn more on the parameters, practical use and more things go to sklearn official site visit this site.

Yes, there is a lot in the decision tree still it is called the white-box model because of its simplicity as it can be easily calculated and can be understood, other complex models. Neural networks are called the Black-box model because of their complexity and high level of calculations which cannot be easily calculated and interpreted.

Success is not final, Failure is not fatal, It is the courage to continue that counts.

Make your life a master piece, imagine no limitations on what you can be, have or do

If you have learned something about the Decision tree and about decision-making do like it, show some support and share it with someone useful. If there are any questions or suggestions or any other things put on the comment box, I would do the best to answer the question. Stay safe and bring the light to the world.🤍

--

--

Yamini
Geek Culture

Blogger, Achiever, Data science aspirant, Soulful person, Optimist