Machine Learning — 5 (Decision Trees & Ensemble)

Gaurav Madan
4 min readDec 17, 2019

--

Decision Trees are used to solve Classification Problems. This is a supervised learning algorithm.

A decision tree tries to branch the tree in such a way that value of Y has maximum dispersion between the leaf nodes. It does this iteratively based on the splitting criteria, until the stopping criteria is met.

The output is useful, because we can now choose/discard certain segments of population to get the desired outcomes (value of Y)

We generally stick to a Binary Partition tree, for the sake of interpretability. Technically, we can split a node into more than 2 nodes also.

Stopping Criteria

  1. Pure Node — node where all the Y are having the same value, then further splitting of the node is stopped. This is because, the splitting is not resulting in further dispersion between the values of Y, at the leaf nodes.
  2. Minimum Sample — If the Observation Sample size at a node ≤ Minimum Sample size (usually 100), then further splitting of the node is stopped. This is because the insights would be statistically insignificant, as this could be a random phenomenon, which cannot be generalized.
  3. Depth of the tree — If the desired (user inputted) depth of the tree is reached, then further splitting of the node is stopped

Types of Decision Trees

  • CHAID — Chi Square Interaction Detector | Oldest decision tree methodology
  • CART — Classification and Regression Trees

CHAID Algorithm

High Level Steps

  1. It will consider one X variable at a time
  2. X Variable can be Categorical or Continuous
  3. Consider all possible splits
  4. Repeat the process for all X variables
  5. Consider that variable and that split, which haves min p-value

2.1) For Continuous variable: It will split by percentile or deciles

For each decile, it will calculate p-values. For the decile where we get lowest p-value, that is the optimum split point.

Repeat this for each X variables. Eg.

X1 (age): 0.01

X2 (years of experience): 0.03

X3 (salary): 0.008

Compare across variables, what is the lowest p-value? Then use the X variable with lowest p-value and the corresponding decile, to split the node.

2.2) For Categorical variable: It will split by categories

E.g. If there are 3 categories: A, B, C

Create binary segments

  • A, (B,C)
  • (A,B), C
  • (A,C), B

whichever segment gives lowest p-value, use that to create the split.

CART Algorithm

There are 2 other measures of dispersion

  • GINI (Highest Gini)
  • Entropy (Highest Gain)

The algorithm remains the same as CHAID, but we use a different measure

  1. It will consider one X variable at a time
  2. X Variable can be Categorical or Continuous
  3. Consider all possible splits
  4. Repeat the process for all X variables
  5. Consider that variable and that split, which haves Max GINI / Highest Gain value

Instead of calculating p-value, we calculate Max GINI / Highest Gain value. That split is the optimal split.

Ensemble Models

There are 3 key models in this family of models. They are primarily used for classification. There are certain variants of these models, which can be used for Regression.

  • Bagging Models
  • Random Forest
  • Boosting

BAGging Models (Bootstrap and Aggregation)

Key Steps

  1. Split the development data into k bootstrap (simple random sampling with replacement) samples
  2. Build decision trees on each sample
  3. Final Output = average output from all Trees

To use it as regression, average the value of Y in that Node in the development data.

Random Forest

Key Steps

  1. Split the data into k random bootstrap samples
  2. Build fully grown decision trees on each sample (no max depth, splitting will only stop on pure node or if the min sample size criteria is met)
  • In each split, only a random fraction of the variables are tried out.
  • When we process only 10% of 100 variables, the processing is faster.
  • Also, if we select variables randomly, it avoids overfitting on a single variable.

3. The final output = average from all trees

OUT OF BAG error

Error made by a tree from forest on observations which were not present in the bootstrap sample, but present in the development data. This error metric is generated by the Random Forest algorithm.

These are testing errors and cannot be optimized.

Variable Importance

We cannot visualize random forest, as it contains thousands of trees, all of which are fully grown trees.

Random Forests are black box models, so businesses usually do not use it for predictive classification where explainability is important. In such cases, it provides Variable Importance for each Variable, which can be used as a variable selection mechanism while building other Models.

Variable Importance = Average reduction in error, when the variable was used to split

Variable selection is usually not done for linear regression. They are only done for classification problems.

Building Models in Python

Bagging Models

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
bc = BaggingClassifier(oob_score = True, n_estimators = 100)
bc.fit(train_X, train_Y)
y_pred = pd.DataFrame("actual": test_Y, "predicted":bc.predict(test_X))metrics.accuracy_score(y_pred.actual, y_pred.predicted)
tree_bg = metrics.confusion_matrix(y_pred.predicted, y_pred.actual, [1,0])

Random Forest

from sklearn.ensemble import RandomForestClassifierrfc = RandomForestClassifier(oob_score = True, n_estimators = 100)rfc.fit(train_X, train_Y)y_pred = pd.DataFrame("actual": test_Y, "predicted":rfc.predict(test_X))

Boosting Algorithm

Key Steps

  1. Build a simple learning algorithm. E.g. A decision stump (Tree with a depth of 1)
  2. Predict the value of Y
  3. Calculate the Error
  4. Identify which observations were incorrectly classified
  5. Give higher weight to observations that were mis-classified
  6. Go to Step 1
  7. When all the Algorithms for the small groups within the population are built, combine these algos by taking a weighted sum of all the previously prepared decision stumps

--

--