Binary decision trees is a supervised machine-learning technique operates by subjecting attributes to a series of binary (yes/no) decisions. Each decision leads to one of two possibilities. Each decision leads to another decision or it leads to prediction. An example of a trained tree will help cement the idea. You’ll learn how training works after understanding the result of training. Decision Trees are used for both Regression and Classification machine learning problems. The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both processes.

Read dataset.

Tip We are not going to test the algorithm, we aim only to illustrate how it works. So we will use the whole dataset without being divided into training and testing datasets.

Train the model

Get our tree visualized.

The above Figure shows the series of decisions produced as an outcome of the training on the wine quality data. The block diagram of the trained tree shows a number of boxes, which are called nodes in decision tree parlance. There are two types of nodes: Nodes can either pose a yes/no question of the data, or they can be terminal nodes that assign a prediction to examples that end up in them. Terminal nodes are often referred to as leaf nodes. The terminal nodes are the nodes at the bottom of the figure that have no branches or further decision nodes below them.

How a Binary Decision Tree Generates Predictions?

How to Determine Split-Point

Using Variance reduction (Mean square error)

That question of how the split point is determined.? The process is to try every possible split point till get best one as follow.

  • As per above tree-plot, column no.(11) has the most significant influence on our target. Why — -> Will be illustrated below
  • Value (10.525) in the above graph’s root node represent the value at which minimum value of Mean Squared Error "MSE" is accomplished as following:
  1. Sort our feature elements condescendingly.
  2. Get average value between first 2-points in our feature and define it as our threshold.
  3. Get corresponding average output values one for the targets before threshold and the other for target values next to the threshold. Theses average values represent 2-predicted values.
  4. 4. Calculate Mean Square Error with reference to each average (mean value).
  5. Repeat points 2–4 but with get new threshold average value between 2nd and 3rd feature points and so on till reach the last 2-points in our feature.
  6. Threshold value in our feature that, achieve the Least MSE for our target.
  • So, by dividing dataset around split value as below to get d1 and d2 datasets:
  1. Notice max value of [X10] “alcohol” in d1 is 10.50 and min value of “alcohol” in d2 is 10.55
  2. Calculate the average of the above 2-values will result in: 10.525 which represents the our split point of root node.
  • Repeat for the rest of non-terminal nodes till reach leaf node.

How to Determine Most Significant Feature (Attribute) to Split Through it?

  • Algorithm repeats the above steps per each feature.
  • Get best Mean Square Error and split point per each feature.
  • Then the most significant feature will have the `lowest MSE’.

Determination of Split-Point Graphically

Now Let’s split dataset with reference to split-point

Tip When the model fits the training data perfectly, it probably means it is overfit and will not perform well with new data

How to prevent from overfitting

  • The simplest way is to split observations only when there enough data to split.
  • Typically the minimum no. of observation to allow split is 20.

