Discover structure behind data with decision trees
Note that this project is also available on GitHub as a Python notebook: https://github.com/Vooban/Decision-Trees-For-Knowledge-Discovery
Let’s understand and model the hidden structure behind data with Decision Trees. In this tutorial, we’ll explore and inspect how a model can do its decisions on a car evaluation data set. Decision trees work with simple “if” clauses dichotomically chained together, splitting the data flow recursively on those “if”s until they reach a leaf where we can categorize the data. Such data inspection could be used to reverse engineer the behavior of any function.
Since decision trees are good algorithms for discovering the structure hidden behind data, we’ll use and model the car evaluation data set, for which the prediction problem is a (deterministic) surjective function. This means that the inputs of the examples in the data set cover all the possibilities, and that for each possible input value, there is only one answer to predict (thus, two examples with the same input values would never have a different expected prediction). On the point of view of Data Science, because of the properties of our dataset, we won’t need to use a test set nor to use cross validation. Thus, the error we will obtain below at modelizing our dataset would be equal to the true test error if we had a test set.
The attribute to predict in the data set could have been, for example, created from a programmatic function and we will basically reverse engineer the logic mapping the inputs to the outputs to recreate the function and to be able to explain it visually.
About the Car Evaluation Data Set
For more information: http://archive.ics.uci.edu/ml/datasets/Car+Evaluation
The Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145–157, 1990.). The model evaluates cars according to the following concept structure:
CAR car acceptability
PRICE overall price:
- buying buying price
- maint price of the maintenance
TECH technical characteristics:
- safety estimated safety of the car
- comfort comfort (doors number of doors, persons capacity in terms of persons to carry, lug_boot the size of luggage boot)
Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples.
The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.
Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.
We’ll now load the car evaluation data set in Python and then train decision trees with Scikit-Learn
Define the features and preprocess the car evaluation data set
We’ll preprocess the attributes into redundant features, such as using an integer index (linear) to represent a value for an attribute, as well as also using a one-hot encoding for each attribute’s possible values as new features. Despite the fact that this is redundant, this will help to make the tree smaller since it has more choice on how to split data on each branch.
Train a simple decision tree to fit the data set:
First, let’s define some hyperparameters, such as the depth of the tree.
Decision tree trained!
Errors: 6.53935185185 %
Accuracy: 93.4606481481 %
Plot and save the tree
Plot the importance of each input features of the simple tree:
Note here that it is the feature importance according to our simple, shallow tree. A fully complex trees would surely include more of the features/attributes, and with different proportions.
Let’s now generate a fully perfect (complex) tree
Let’s go deeper. Let’s build a deeper tree. At least, a simple tree like the one above is interesting for having a simplfied view of the true logic behind our data.
Decision tree trained!
Errors: 0.0 %
Accuracy: 100.0 %
A plot of the full tree
Note that the tree below can also be viewed here online. It would also be possible to extract the tree as true code and create a function.
Finally, the full feature importance:
To sum up, we managed to get good classification results and to be able to explain those results visually and automatically. Note that it would have been possible to solve a regression problem with the same algorithm, such as predicting a price rather than a category.
Such a technique can be useful in reverse engineering an existing system, such as an old one that has been coded in a peculiar programming language and for which the employees who coded it have left. This technique can also be used for data mining, gaining business intelligence, and insights from data.
In the case that your data does not represent a pure function like we have here, such as if for two of your input examples it is possible to have two possible different predictions, then a tree cannot model the data set with 100% accuracy. Hopefully, if you are in that situation where the logic behind the data is not perfect, it is possible to repeat the experiment by using XGBoost, which can help by incrementally training many trees to reduce the error and training them in an optimized fashion. The only disadvantage of that is that those boosted forests would be harder to explain due to the fact that you would have many trees.