Using Decision Tree Method for Car Selection Problem

Hafidz Jazuli
Machine Learning Guy
12 min readMar 16, 2018

In this tutorial, I will show you how to use C5.0 algorithm in R. If you just came from nowhere, it is good idea to read my previous article about Decision Tree before go ahead with this tutorial. I designed this tutorial within practical statistic approach, so you can think the practical implementation after learned this tutorial. But in general, everybody (at least at high school) can interpreted this tutorial without any difficulties.

In order to build a high quality tutorial, I was crafted this tutorial based on proved research paper and high quality data set. If you want to get more insight from this tutorial, then it is good idea to complete all prerequisites below before read this tutorial.

Prerequisites

  1. Understand the basic concept of Decision Tree.
  2. In this tutorial, I used C5.0 algorithm which is latest improved version of ID3 algorithm. If you are not familiar with this algorithm, its good idea to read resource [4] or go to this link.
  3. Skim this [1] and this [2] research paper because this tutorial based on them. They are seriously easy-to-read papers, but if you are hate math, just skip all the math, I will provides you easy-to-follow problem solving art in this tutorial (as for large audiences, we look math as art of problem solving or tool for engineering). For information, the first paper is taking decision theory approach, while second paper is taking linear programming approach.
  4. Go to this UCI data set repository to get more insight about data set which I used in this tutorial. By understand high quality post-processed data set, you can get more insight of how to build your own data set in the future.

Yes, that’s all. No prior programming experience required for this easy tutorial. Enjoy the tutorial!

Get Know The Data Set

This dataset contains 1728 data about car’s criteria. All criteria has been labeled, so we used unsupervised learning method to infer from the data. The data contains categorical values, so we conduct Classification task to acquire knowledge from the data. This data set has been used by two research papers: [1] and [2].

This data set also known as “The Car Evaluation Database” contains examples with the structural information removed. The knowledge representation of this data set illustrated in an image below.

Criteria tree

According to the criteria tree above, the quality of cars is measured by two main groups of criteria: price (PRICE) and technical characteristics (TECH). The price is determined by buying and maintenance price. Technical characteristic are decomposed into safety and comfort, which further depends on number of doors, size of car (measured as number of persons that fit in the car) and size of the luggage boot.

The concept structure of the data described below:

CAR car acceptability
PRICE overall price
— — buying buying price
— — maint price of the maintenance
TECH technical characteristics
— — COMFORT comfort
— — — doors number of doors
— — — persons capacity in terms of persons to carry
— — — lug_boot the size of luggage boot
— — safety estimated safety of the car

Each attribute described below:

  • buying: v-high, high, med, low
  • main (maintenance): v-high, high, med, low
  • door: 2, 3, 4, 5-more
  • persons (number of passengers fit in a car): 2, 4, more
  • lug_boot (size of luggage capacity): small, med, high
  • safety: low, med, high

The quality of car labeled as unaccepted (unacc), accepted (acc), good (good) and very good (vgood). While class distribution of each label is given below:

Attribute class distributions

Based on the class distribution, it’s look like we have skewed data. That’s means we have right choice using classification model to infer from this data set. Although, some more advanced method such as ensemble learning can be used, in this tutorial we only focus on basic implementation of Decision Tree as for learning purpose.

R Programming Environment

R is popular open source statistical software which can be downloaded at here. The main advantage by using R is memory and storage efficiency. In my opinion from programming point of view: R is easy to use; has similar syntax with Python; and highly optimized to handling large data set. The main difference from general purpose programming language is, R provides better way to slice large list into pointer of segmented lists.

Production Method Used in This Tutorial

There are two production method used in this tutorial: implementation and evaluation method. The step-by-step implementation method described below:

  1. Analyzing attribute of each criteria.
  2. Apply randomized splitting data set into training set and test set.
  3. Apply C50 algorithm on training set.
  4. Using test set to predict classification accuracy.
  5. Infer decision tree from the model.

While evaluation method described below:

  1. Evaluate model fitting using model summary.
  2. Evaluate model prediction error using confusion matrix.
  3. Evaluate model prediction error using precision-recall and F1 score.
  4. Evaluate splitting rule using learning curve.

Implementation

You can found the source code of this implementation at my github repository. In this tutorial, I have stressed on analysis aspect, not on the source code. If you are find any difficulties in reading or running the source code, make some effort to search your problem using search engine or access the help feature in R studio by pressing F1 on selected function name.

Analyzing attribute distribution of each criteria

This is first and most important step because in this step we evaluate the quality of the data. As if you working with some statistician, you may suggested that the goodness of inferential deduction is not depend so much on the algorithm, but the quality of the data. Another advantage by evaluate the data is we can make presumption about the result will be. In more advance Decision Tree implementation, there is known a technique called as pre-prunning technique which the goal is remove unreliable features, thus will improve learning performance. But, pre-prunning technique need some experience or expertise, so we not concert about pre-prunning technique at this step.

As to minimizing the assumption as small as possible, we need to look each criteria independently distributed at first. If we look that way, then the explanation of each distribution described as follow:

What buying histogram tells us are the distribution tend to be uniformly distributed, while with very high and high buying price will likely caused a car to be unaccepted.

Buying histogram

What maintenance histogram tells us are the distribution tend to be uniformly distributed, while with very high and high maintenance price will likely caused a car to be unaccepted.

Maintenance histogram

What doors histogram tells us are the distribution tend to be uniformly distributed, while with 2 doors will likely caused a car to be unaccepted.

Door histogram

What persons histogram tells us are the distribution tend to be positive skewed, while it is almost absolutely that a car with 2 persons capacity will be unaccepted.

Persons histogram

What luggage boots histogram tells us are the distribution tend to be negative skewed, while with small luggage boots will likely caused a car being unaccepted.

Luggage boots histogram

What safety histogram tells us are the distribution tend to be normally distributed, while with low safety will likely caused a car being unaccepted. Since in statistic point of view normal distribution is a nature in almost real-world cases, then we can decide that safety is most significant feature in our decision tree.

Safety histogram

Apply randomized splitting data set into training set and test set

In this tutorial, I used 75% for training set and 35% for test set splitting rule after applied randomized algorithm. In statistic we call this technique as random sampling. Because the randomization, each execution will produce different error rate. So, result in this tutorial and your execution result may slightly different with confident interval 95%. If you confused or need proof about random sampling and confidence interval, then try to read central limit theorem and confidence interval.

In evaluation method, we will used incremental percentage splitting rule to produce learning curve. By using learning curve, we will know which is better splitting rule to be used.

Apply C50 algorithm on training set

This is easiest step because in R, we only need to type a line of code to apply C50 algorithm. By calling an object decTree directly will tells us about number of decision nodes (tree size).

Using test set to predict classification accuracy

This is also easiest step because in R, we only need to type a line of code to make prediction using C50 algorithm. The prediction will produce a row factor called as y_pred which is the model prediction of the car criteria.

Infer decision tree from the model

Since our model produced a decision tree with size in about 40–55 leaves, then our decision tree will be likely complicated. For example, an image below is a decision tree with 83 nodes which has 53 leaves.

Car decision tree

A better way to infer decision tree is by read a model’s summary. Here a sample of decision tree summary used in this tutorial:

# Format: [criteria] = [atribute]:[classification] (number of
# valid decision node/number of invalid decision node)
# ... is connection or edge
safety = low: unacc (435)
safety in {high,med}:
:...persons = 2: unacc (279)
persons in {4,more}:
:...buying in {high,vhigh}:
:...maint = vhigh: unacc (74)
: maint = high:
: :...buying = vhigh: unacc (41)
: : buying = high:
: : :...safety = high: acc (20/1)
: : safety = med:
: : :...lug_boots in {big,med}: acc (10/3)
: : lug_boots = small: unacc (6)
: maint in {low,med}:
: :...safety = high: acc (68/2)
: safety = med:
: :...lug_boots = big: acc (24)
: lug_boots = small: unacc (24)
: lug_boots = med:
: :...doors = 2: unacc (6)
: doors in {4,5more}: acc (12)
: doors = 3:
: :...persons = 4: unacc (3)
: persons = more: acc (4)
buying in {low,med}:
:...maint in {high,vhigh}:
:...safety = med:
: :...lug_boots in {big,med}: acc (50/6)
: : lug_boots = small:
: : :...maint = vhigh: unacc (14)
: : maint = high:
: : :...buying = low: acc (5/1)
: : buying = med: unacc (5)
: safety = high:
: :...buying = med: acc (38/1)
: buying = low:
: :...maint = vhigh: acc (19)
: maint = high:
: :...lug_boots = big: vgood (4)
: lug_boots = small: acc (7/1)
: lug_boots = med:
: :...doors = 2: acc (2)
: doors in {3,4,5more}: vgood (4)
maint in {low,med}:
:...safety = high:
:...lug_boots in {big,med}:
: :...lug_boots = big: vgood (25)
: : lug_boots = med:
: : :...doors in {4,5more}: vgood (12)
: : doors = 2:
: : :...buying = low: good (2)
: : : buying = med: acc (3/1)
: : doors = 3:
: : :...persons = 4: good (3/1)
: : persons = more: vgood (4)
: lug_boots = small:
: :...doors = 2:
: :...persons = 4: good (2)
: : persons = more: unacc (4)
: doors in {3,4,5more}:
: :...maint = low: good (11)
: maint = med:
: :...buying = low: good (4)
: buying = med: acc (2)
safety = med:
:...lug_boots = small:
:...persons = 4: acc (12)
: persons = more:
: :...doors = 2: unacc (2)
: doors in {3,4,5more}: acc (7)
lug_boots in {big,med}:
:...maint = med:
:...buying = low: good (13/3)
: buying = med: acc (13)
maint = low:
:...lug_boots = big: good (13)
lug_boots = med:
:...doors in {2,3}: acc (5/1)
doors in {4,5more}: good (6)

Evaluation

Basically, we can used only the model’s summary to evaluate model prediction error. But, in order to provide comprehend proof, we need another validation such us confusion matrix to computer true positive (tp), true negative (tn), false positive (fp), and false negative (fn).

Evaluate model fitting using model summary

An image below is a sample of model summary. It is look like that our model doing a good job where classification error only in about 1.2%, or in other word from 53 available decision node, we are predict 16 correctly decision node. At detail, the summary tells us that there are 8 unaccepted node classified as accepted node, 1 good node classified as accepted node, 1 unaccepted node classified as good node, 1 accepted node classified as good node, 4 accepted node classified as very good node, and 1 good node classified as very good node.

C50 model summary of car

Another information that provided from this summary is attribute usage which tells us about the distribution of attributed used to produce decision nodes. As explained in the image above, safety attribute is most used attribute followed by persons and buying.

Evaluate model prediction error using confusion matrix

An image below is sample confusion matrix of our model. It’s look like our model fit good enough on test set where there are 14 miss classified decision node.

Confusion matrix

Evaluate model prediction error using precision-recall and F1 score

In order to get better insight of measuring prediction error, we used another measurements: precision, recall, and F1 score. An image below is sample of that measurements of our model.

Precision, recall, and F1 score

Since F1 score is the measurement combination of precision and recall, then we can used F1 score to measure our model performance. In image above, show us that our model has high F1 score on classify unacc, acc, and vgood label.

Evaluate splitting rule using learning curve

In this tutorial, we used cross-validation method and accuracy measurement to produce learning curve. An image below is learning curve produced using five iterations on each cross-validation step.

Learning curve

Learning curve on image above tells us that splitting rule from range 60:40 until 80:20 give us significant performance.

Conclusion

As we see in model’s summary, the criteria distribution making enough contribution in order to construct decision tree. The safety criteria which is distributed normally and 100% used split decision tree into two, while persons criteria which is positively skewed and 66.31% used became root node.

Our splitting rule which is 75:35 seems performs good enough, but we may still need to tune the splitting rule in range 60:40 to 80:20 if necessary.

[*] Assume that we worked on a car factory and want to produce a car. By using decision tree produced C50 algorithm, we need to know which car criteria is likely will be pass the evaluation. After some amount of time analyzing the decision tree, we are decide to pick some most seven significant decision node:

# Format: (criteria valid/miss-classified) -> statement
(vgood 24) -> car has capacity 4 or more persons, low or medium buying pring, low or medium maintenance price, high safety, and has big luggage boots.
(vgood 14/3) -> car has capacity 4 or more persons, high safety, low buying price, high maintenance price, and medium or big luggage boots.
(good 14) -> car has capacity 4 or more persons, low or medium buying price, medium safety, low maintenance price and big luggage boots.
(good 12/3) car has capacity 4 or more persons, medium safety, has medium or big luggage boots, medium maintenance price and low buying price.
(acc 51) -> car has capacity 4 or more persons, medium or high safety, high or very high buying price, low or medium maintenance price and big luggage boots.
(unacc 77) -> car has capacity of 2 persons, medium or high safety, very high maintenance price.
(unacc 35) -> car has capacity of 2 persons, high and very high buying price, low or medium or high maintenance price, small luggage boots and medium safety.

Then, by using picked decision node above you want to test this statement:

persons = 4
buying = ’med’
maint = ’low’
lug_boots = ’med’
safety = ’med’
doors = 4

[*] The model will classified that statement as ‘good’.

Suggestion

If you are looking for current implementation of Decision Tree learning, then try read resource [3]. At resource [3], you will find out that Decision Tree learning model has equally performance to Neural Network in case predicting building energy demand in Japan.

Resources

  1. Knowledge acquisition and explanation for multi-attribute decision making
  2. Machine learning by function decomposition
  3. A decision tree method for building energy demand modeling
  4. A compararive study of decision tree ID3 and C4.5
  5. Induction of decision tree
  6. Foundations for statistical inference — Confidence intervals
  7. Package C50
  8. Car evaluation data set
  9. Substitute good
  10. Utility
  11. Unsupervised learning method

--

--