J48 Classification (C4.5 Algorithm) in a Nutshell

10 min readAug 18, 2021

(Under the guidance of Professor Amrinder Arora at George Washington University)

Introduction

We live in the world of technology, the internet has opened the doors for vast knowledge and research being available at everyone’s fingertips. Continuous innovation and advancement has opened many doors as a result we are generating loads and loads of useful and meaningful data in every aspect of our life. The question is what do we do with this data, how do we use it to our advantage, how do we dig deeper into the data to see what it is telling us. This is where the power of machine learning and artificial intelligence comes into play. These techniques provide smart alternatives to analyzing large volumes of data. By developing fast and efficient algorithms and data-driven models for real-time processing of data, machine learning can produce accurate results and analysis.

J48 algorithm is one of the most widely used machine learning algorithms to examine the data categorically and continuously. The C4.5 algorithm (J48) is mostly used among many fields for classifying data for example interpreting the clinical data for the diagnosis of coronary heart disease, classifying E-governance data, and many more.

Classification

The machine learning process has two main phases: a learning phase, where the classification algorithm is trained, and a classification phase, where the algorithm labels new data. Classification is a data mining task that maps the data into predefined groups and classes, also known as supervised learning.

J48 Classification and its Decision Tree

C4.5 algorithm/J48

The C4.5 algorithm is a classification algorithm which produces decision trees based on information theory. It is an extension of Ross Quinlan’s earlier ID3 algorithm also known in Weka as J48, J standing for Java. The decision trees generated by C4.5 are used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

The J48 implementation of the C4.5 algorithm has many additional features including accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In the WEKA data mining tool, J48 is an open-source Java implementation of the C4.5 algorithm. J48 allows classification via either decision trees or rules generated from them.

This algorithm builds decision trees based on a set of training data in the same way the ID3 algorithm does, by using the concept of information entropy. The training data is a set S={s1, s2, …} of already classified samples. Each sample si consists of a p-dimensional vector (x1, i, x2, i, …, xp, i) where the xj represents the attribute values or features of the corresponding sample, as well as the class in which the sample falls. To gain the highest classification accuracy, the best attribute to split on is the attribute with the greatest information.

At each node of the tree, the C4.5 algorithm chooses the attribute of the data that most effectively splits its set of samples into subsets, enriched in one class or the other. The splitting criterion is the normalized information gain, which is calculated from the difference in entropy. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the partitioned sub lists utilizing a divide-and-conquer approach and creates a decision tree based on the greedy algorithm.

Figure 1. A simple decision tree for predicting what the students route throughout high school will be based on what classes they are taking.

In a more concrete and high-level example, the algorithm works on the decision tree at Figure 1 by looking at the data of students who have already taken said courses, and uses that to build a model to predict what an upcoming student will take based on their choice of school as a freshman. We take our list of students and start splitting them into data sets randomly, and with each data set, and generate a set of weights predicting the path of a student, and then choose the data set that most accurately predicts what a student will take.

The Decision Tree

The internal nodes of this decision tree denote the different attributes, and the branches between the nodes tell us the possible values that these attributes can have in the observed samples, and the terminal nodes tell us the final value (classification) of the dependent variable. The predicted attribute is the dependent variable and the other attributes in the tree are the independent variables in the dataset.

When building a tree, J48 ignores the missing values; that item’s value can be predicted from the attribute values for the other records. The basic idea is to divide the data into ranges based on the attribute values found in the training sample.

Some basic use cases in this algorithm are:

This algorithm creates a list for all samples in a class.
It evaluates features to assess any information gain, if no gain is possible, then this algorithm creates a node higher up in the tree using the expected value of the class.
If this algorithm encounters an unknown class that it has not seen so far, then it creates a node higher up in the tree using the expected value.

Decision trees are utilized to delineate decision-making processes. It is a classifier which acts like a flowchart like tree construction to depict association models. The decision trees are utilized to categorize instances by sorting them down the tree from origin to a little leaf node. Every single node specifies an examination of the instance, and every single division corresponds to one of the probable benefits for this attribute.

It divides a dataset into tinier and tinier subsets and is incrementally developed. The final consequence is a tree alongside decision nodes and leaf nodes. Each Decision node has two or more divisions and the leaf node embodies an association or decision. The topmost decision node in a tree that corresponds to the best predictor shouted the origin node.

Tree Pruning

Overfitting is a prime risk with these decision trees, so it is vital to cross validate the model before deploying it to the test set. As this decision tree is grown, it naturally tends to overfit the data. Pruning is a process by which the largest tree that is the most generalizable is selected, and all the branches below that level are pruned out. This improves performance on new data significantly.

The WEKA tool for J48 provides several options associated with tree pruning. In case of potential over-fitting, pruning can be used as a tool for summarizing. In other algorithms, the classification is performed recursively until every single leaf is pure, i.e. the classification of the data should be as perfect as possible. This algorithm generates the rules from which identity of that data is generated. The objective is progressive generalization of a decision tree until it gains an equilibrium of flexibility and accuracy.

Pre-Pruning and Terminating Conditions

The issue with pruning is that it is rather inefficient as it occurs after the decision tree has been built, and the tree may have been overfitted. So the next improvement to this algorithm would then be to prune the tree as it is being built. The concept of pre-pruning is where the algorithm is terminated before the decision tree becomes too large or before it becomes a fully grown tree. This prevents the risk of overfitting and keeps the decision trees from becoming too complex. The main goals of pre-pruning a tree are to stop the algorithm if all the samples belong to the same class or if all the attribute values are the same. However, those two points are not a common occurrence and don’t address all aspects of overfitting. Thus, there are three terminating conditions that should be applied while building a decision tree, the first is to stop the algorithm if the sample size is too small, the second is to limit the depth of tree to prevent it from becoming too large, and the third is to not split a node into branches if there is not a sufficient decrease in the classification error. With these three terminating conditions, the J48 algorithm becomes much more efficient and accurate as it prevents overfitting the decision tree.

Limitations

Limitations of Decision Trees

Even though these decision trees are very helpful in analyzing and classifying the data, they also have some disadvantages associated with pruning. Pruning decreases the complexity in the final classifier, and therefore improves the predictive accuracy from the decrease in over-fitting.

Some of the associated issues of these decision trees are:

The tree itself is computationally costly at every single node, every single candidate dividing the dataset must be sorted before its best tier can be found.
Pruning algorithms can additionally be expensive, as countless candidate sub trees have to be industrialized and compared.
The entropy measurement is resource intensive.
The decision tree algorithm produces a colossal size tree.
Tiny example sizes of a training set pose a main challenge to decision trees, as the number of obtainable training instances drops exponentially as the tree splits out.
The decision criteria is rigid in the sense that in the end only one node can be selected.
Deep decision trees encompassing countless levels might additionally encompass countless attributes which leads to outlier attribute values.

Limitations of J48 Algorithm

The J48 algorithm revolves around decision trees, so it suffers from some of the problems associated with decision trees. Here are a few shortcomings of this algorithm:

Empty Branches: Constructing trees with significant values is one of the important steps for rule generation by the J48 algorithm. But these values don’t contribute to creating any classes for the classification tasks, instead, they make the tree wider and more complicated.
Insignificant Branches: The number of chosen distinct attributes produces the same number of potential divisions to build a decision tree. But the fact is, not all of them are significant for classification tasks. These unimportant branches not only decrease the usability of decision trees but also bring on the problem of overfitting.
Over Fitting: Over fitting happens when an algorithm display gets information with exceptional attributes. This causes many fragmentations, i.e. statistically unimportant nodes, in the process distribution. Usually the J48 algorithm builds trees and grows its branches “just deep enough to perfectly classify the training examples” as this approach performs better with noise free data however, most of the time this strategy overfits the training examples with noisy data. At present there are two strategies which are widely used to bypass this overfitting in decision tree learning. The first is that if a tree grows taller, stop it from growing before it reaches the maximum point of accurate classification of the training data. The second strategy is to let the tree overfit the training data, and then post-prune the tree.

Optimizations

Improvements of J48 Using Better Entropy Scheme

The key point in the assembly of decision trees is the choice of the best attribute to decide the tier of the current node. For each value in the training set, it takes the best entropy calculated to determine the node placement. There are a few limitations with the training sets in general, such as: the training set contains one or more samples which all belong to the same class, the training set contains no samples, and the training set contains samples that belong to a blend of classes. Due to these limitations, the resultant decision tree may not be accurately depicting the generalization, therefore people have come up with different ways to calculate the entropy gain more accurately. The J48 algorithm can be improved with a better entropy gain calculation.

Improvements of J48 by Using Nonlinear Hybrid Classifications

The most common way to build decision trees is by top-down partitioning. The training set is used recursively to find a linear split that maximizes entropy gain. The tree obtained by this process is usually too big and overfits the data, so it is pruned by examining each intermediate node and evaluating the utility of replacing it with a leaf.

After pruning, the local set of training cases for a leaf node can become quite large, and just taking the majority class may generalize the tree too much. The J48 logic can be extended from the basic decision tree learning algorithm so that instead of the majority rule, a different kind of model can be used in any of the leaves. The decision of whether to replace a simple leaf by an alternative model is made during post-pruning. The alternative leaf classifiers are mostly introduced during the post-pruning phase. The error estimates are compared before replacing a leaf node. The resulting hybrid algorithms combine the biases of top-down, entropy-based decision tree induction, and the respective alternative leaf models.

Improvements of J48 By Meta Learning

In data mining, it is generally necessary to set the parameters used by the algorithm in order to achieve the best possible model and results. Experiments show a substantial increase in accuracy when the right parameters are used. However, there is an associated problem in adjusting the parameters of most data mining algorithms. This task may involve a high computational cost for finding the optimal parameters or else risk relying on assumptions that may bias the results.

Most current data mining algorithms and tools need to be configured before they are executed. In other words, users must provide appropriate values for the parameters in advance in order to obtain good results or models; therefore, the user must possess a certain amount of expertise in order to find the right settings. To resolve this problem, data mining can be used to learn from past executions of the algorithms in order to improve the future selection of parameters according to the past behavior of the algorithm.

Meta-learning is the study of principled methods that exploit meta-knowledge to obtain efficient models and solutions by adapting machine learning and the data mining process. The decision tree model has some parameters that influence the amount of pruning. By trimming trees, the computational efficiency and classification accuracy of the model can be optimized.

Conclusion

In conclusion, the J48 algorithm is a useful tool in classifying data and despite its limitations, it can be improved upon to be a very accurate classification algorithm. The C4.5 algorithm is a linear approach to classifying the data as it creates a decision tree based on the training data given to it. However, this algorithm may often overfit the data or have empty or insignificant branches in the tree, and one of the major ways to fix these issues is to prune the tree. Another major improvement to this algorithm is meta learning, utilizing previous data and past executions of the algorithms, to better classify the data it’s been given. In order to make the algorithm even more reliable and accurate, more improvements will need to be discovered for this algorithm.