Binary Decision Trees
Regression Trees
Introduction
Binary decision trees is a supervised
machine-learning technique operates by subjecting attributes to a series of binary (yes/no) decisions. Each decision leads to one of two possibilities. Each decision leads to another decision or it leads to prediction. An example of a trained tree will help cement the idea. You’ll learn how training works after understanding the result of training. Decision Trees
are used for both Regression
and Classification
machine learning problems. The term Classification And Regression Tree
(CART
) analysis is an umbrella term used to refer to both processes.
Read dataset.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
Tip We are not going to test the algorithm, we aim only to illustrate how it works. So we will use the whole dataset without being divided into training and testing datasets.
Train the model
Get our tree visualized.
The above Figure shows the series of decisions produced as an outcome of the training on the wine quality data. The block diagram of the trained tree shows a number of boxes, which are called nodes in decision tree parlance. There are two types of nodes: Nodes can either pose a yes/no question of the data
, or they can be terminal
nodes that assign a prediction
to examples that end up in them. Terminal
nodes are often referred to as leaf nodes
. The terminal nodes are the nodes at the bottom of the figure that have no branches or further decision nodes
below them.
How a Binary Decision Tree Generates Predictions?
How to Determine Split-Point
Using Variance reduction (Mean square error)
That question of how the split point is determined.? The process is to try every possible split point till get best one as follow.
- As per above tree-plot, column no.(11) has the most significant influence on our target. Why — ->
Will be illustrated below
- Value (10.525) in the above graph’s root node represent the value at which minimum value of
Mean Squared Error "MSE"
is accomplished as following:
- Sort our feature elements condescendingly.
- Get average value between first 2-points in our feature and define it as our threshold.
- Get corresponding average output values one for the targets before threshold and the other for target values next to the threshold. Theses average values represent
2-predicted values
. - 4. Calculate
Mean Square Error
with reference to each average(mean value)
. - Repeat points 2–4 but with get new threshold average value between 2nd and 3rd feature points and so on till reach the last 2-points in our feature.
- Threshold value in our feature that, achieve the
Least MSE
for our target.
- So, by dividing dataset around
split
value as below to getd1
andd2
datasets:
- Notice
max
value of [X10] “alcohol” in d1 is10.50
andmin
value of “alcohol” in d2 is10.55
- Calculate the average of the above 2-values will result in:
10.525
which represents the our split point ofroot
node.
- Repeat for the rest of non-terminal nodes till reach leaf node.
How to Determine Most Significant Feature (Attribute) to Split Through it?
- Algorithm repeats the above steps per each feature.
- Get best
Mean Square Error
andsplit
point per each feature. - Then the most significant feature will have the `lowest MSE’.
Determination of Split-Point Graphically
Now Let’s split dataset with reference to split-point
Determination of Split-Point Graphically
Tip When the model fits the training data perfectly, it probably means it is overfit and will not
perform well with new data
How to prevent from overfitting
- The simplest way is to split observations only when there enough data to split.
- Typically the minimum no. of observation to allow split is
20
.