How C4.5 Algorithm Handling Continuous Features

Dyah Ayu Sekar Kinasih
3 min readAug 27, 2023

--

We cannot directly split nodes with continuous features since their values are infinite. The discretization techniques come in handy in such cases. The most straightforward discretization strategy is bi-partition, which is used by the C4.5 decision tree.

Given a data set D and a continuous feature a, suppose n values of a are observed in D, and we sort these values in ascending order, denoted by {a¹,,, aⁿ}. With a split point t, D is partitioned into the subsets Dₜ- and Dₜ+, where Dₜ- includes the samples with the value of a not greater than t, and Dₜ+ includes the samples with the value of a greater than t. For adjacent feature values a^i and a^i+1, the partitions are identical for choosing any t in the interval [a^i , a^i+1). As a result, for continuous feature a, there are n−1 elements in the following set of candidate split points:

where the midpoint is used as the candidate split point for the interval [a^i , a^i+1). Then, the split points are examined in the same way as discrete features, and the optimal split points are selected for splitting nodes. We can modify the information gain equation from this

to this

where Gain(D, a, t) is the information gain of bi-partitioning D by t, and the split point with the largest Gain(D, a, t) is selected.

For illustration, we build a decision tree using the watermelon data set.

At the beginning, all 17 training samples have different density values. According to the split equation above, the candidate split point set includes 16 values: T_density = {0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518, 0.574, 0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708, 0.746}. According to the modified information gain euation above, the information gain of density is 0.262, and the corresponding split point is 0.381. For the feature sugar, its candidate split point set includes 16 values: T_sugar = {0.049, 0.074, 0.095, 0.101, 0.126, 0.155, 0.179, 0.204, 0.213, 0.226, 0.250, 0.265, 0.292, 0.344, 0.373, 0.418}. Similarly, the information gain of sugar is 0.349 and the corresponding split point is 0.126. Combining the results, the information gains of features in table above are

Since splitting by texture has the largest information gain, it is selected as the splitting feature for the root node. The splitting process proceeds recursively, and the final decision tree is shown in figure below:

Unlike discrete features, a continuous feature can be used as a splitting feature more than once in a decision sequence. For example, using density ≤ 0.381 in a parent node does not forbid the use of density ≤ 0.294 in a child node.

--

--