Day 1 Learning : Decision Trees (A supervised machine learning algorithm)
Machine Learning: (Arthur Samuel) The field of study that gives computer the ability to learn without being explicitly programmed. It is classified in to
a).Supervised Learning and b).Unsupervised Learning
Decision tree comes under the supervised machine learning model.
They map the non linear relationships quite well. Here, we split the population into 2 or more homogeneous sets based on the most significant splitter. This algorithm works for both categorical and continuous input.
- Regression trees are used when the dependent variable is continuous.
- Classification trees are used when the dependent variable is categorical.
Decision tree follows a recursive binary splitting methodology. Following are the different algorithms used for splitting.
- Gini Index: It is calculated by subtracting the sum of the squared probabilities of each node from one. It works well for categorical target variable.
- Chi-Square: It is calculated by the sum of the squares of standardized differences between observed and expected frequencies of the target variable. It can perform 2 or more splits. CHAID stands for Chi-squared Automatic Interaction Detector, for classification problems, it relies on the Chi-square test to determine the best next split at each step; for regression problems the program will compute F-tests.
- Information Gain: The measure of purity(nodes with only one class)is called Information while the measure of impurity is called Entropy. Information gain is the difference of entropy before and after the split.For a root node, we pick the one with the highest information gain.
Entropy = -a loga — b logb where a is the p(success) and b is the p(failure).
ID3 ( Iterative Dichotomiser) algorithm:
- Calculate the entropy of every attribute.
- Split the data set into subsets using the attribute for which the resulting entropy (after splitting) is minimum or information gain is maximum.
- Make a decision tree node containing that attribute.
- Recurse on subsets using remaining attributes.
Problems with this algorithm: Expensive to train and prone to over-fitting .
4). Variance Reduction: It corresponds to the model that minimizes the mean squared error. This algorithm is used for continuous variables(Regression). It uses variance to choose the best split. The split with lower variance is preferred.
Over-fitting: High bias & Low variance
Under-fitting: Low bias & High variance.
Addressing over fitting:
Pruning: This method is used to address the misclassification errors. Once the tree is formed the nodes which give negative returns are removed from the tree.
Regression vs Decision tree:
If the relationship between the dependent and independent variable is linear in nature, then linear regression outperforms the tree.
If the relationship between the dependent and independent variable is non-linear and complex in nature, then tree model outperforms the regression method.
Ensemble methods will be explained in the next post.