Decision Trees on mark! || Need — Why — When || Quick Hands-On || Conclude

Atul Mishra
Analytics Vidhya
Published in
12 min readDec 23, 2020

Intuition for Decision Trees!

My younger sister always says, I’m bad…when it comes to lying!, and i ask her how can she conclude on this. She quoted the below points:

  • Well, when you are lying, you develop a small smile while saying out the sentence, that means you are proactively thinking about whatever story you are building.
  • Then suddenly you deepen out your voice, which is one big sign!…and many more.

My objective is not to tell you how i lie😅, but using this example, now assuming two features as smile_while_lying(true or false) and heavy_voice(true or false), she is able to conclude target variable (lying or not lying).

So, Decision Tree is a tree based algorithm which works on an “if-else” statement logical condition. It can be used for both classification and regression problems. Also, input to the Decision tree algorithm can be continuous as well as categorical.

Components Involved in DT:

  • Root Node: This is the base node for any decision tree creation. This attribute is finalized using a STATISTICAL APPROACH commonly knows as Splitting technique to lower out the impurity in the dataset for a particular feature.
  • Internal Node: The Sub-Nodes which are still in the process or still have the probability of getting splitted out.
  • Leaf Node: The final node with no more children such that the maximum impurity has been removed by the model.
Decision Tree Nomenclature

Splitting Technique in DT:

  • ID3 -> (Iterative Dichotomiser 3)
  • CART -> (Classification and Regression Tree)

Well, these are the two main criterias of deciding the split to be performed on which feature. And, it all boils down to the impurity obtained while splitting. So, after a split we get a more HOMOGENOUS split, then chosen feature for ROOT NODE is the desired one.

A general algorithm for a decision tree can be described as follows:

  1. Pick the best attribute/feature. The best attribute is one which best splits or separates the data.
  2. Ask the relevant question.
  3. Follow the answer path.
  4. Go to step 1 until you arrive to the answer.

The best split is one which separates two different labels into two sets. Something like this:

Visual Rep. for a DT steps

Now, how do we go onto choose the best splitting feature, how do i decide that WEATHER is the feature that leads to the lowest impurity. Well, we have two metrics to calculate that.

Assumptions for Decision Trees:

  • Initially all the training set is considered as root.
  • Features are preferred to be categorical, if continuous then they are discretized.
  • Records are distributed recursively on the basis of attribute values.
  • The attribute to be in root node or in internal node is done by using a STATISTICAL(ID3 or CART) approach

Iterative Dichotomiser 3:

ID3 is a greedy algorithm that grows the tree top-down, at each node selecting the attribute that best classifies the local training examples. This process continues until the tree perfectly classifies the training examples or until all attributes have been used.

Briefly, the steps to the algorithm are:

  1. Select the best attribute → A .
  2. Assign A as the decision attribute (test case) for the NODE.
  3. For each value of A, create a new descendant of the NODE.
  4. Sort the training examples to the appropriate descendant node leaf. If examples are perfectly classified, then STOP else iterate over the new leaf nodes.

ID3 uses the Information Gain as an evaluation metric for the impurity. To define information gain precisely, we need to define a measure commonly used in information theory called entropy that measures the level of impurity in a group of examples. Mathematically, it is defined as:

Entropy to calculate impurity

Now, given entropy as a measure of the impurity in a sample of training examples, we can now define information gain as a measure of the effectiveness of an attribute in classifying the training data. Information gain, Gain (S, A) of an attribute A, relative to a sample of examples S, is defined as:

where Values(A) is the set of all possible values for attribute the A, and Sv is the subset of S for which attribute A has value v. So, in short:

Mathematical Example: We want to figure out, A1 or A2, which attribute gives us the best split by giving HIGHEST INFORMATION GAIN metric!?

So, Gain(S,A1) has higher ENTROPY than Gain(S,A2), which indicates more homogeneity obtained when splitting is happening on attribute A1.

CART (Classification and Regression Trees):

CART is another technique to split the nodes into root/internal node. It uses the concept of GINI INDEX which is infact the cost function to reduce the impurity. Personally i feel GINI INDEX/SCORE to be more friendly and much easier to grab than ID3 or C4.5/5.0

It is given as:

p(i) — is the probability of picking a data point with class i

The algorithm works as below:

  • First we calculate the Gini impurity for sub-nodes using “1-(p²+q²)”.
  • Then we calculate the Gini Index for split using the weighted gini score of each node for that split.

Quick Example:

So, if we take the same above example, i want to show how the split is calculated using GINI INDEX Method by hand.

By-hand experimentation for GINI SCORE

I thought of making GINI calculations by hand, so that reader should be able to get what i am trying to explain and showcase!.

So, a recap to the topics we did:

  • Decision Trees Introduction and Intuition
  • Components of Decision Trees
  • Assumptions of Decision Trees
  • Splitting Methodologies (ID3 and CART)
  • Explanation and Mathematical calculation for Splitting Techniques

Now, comes my favourite part.

Hands-On:

For experiment and Hands-On purpose we will be using Scale Balance Dataset where the target variable is a Multiclass Label (L-> Left, B-> Balance, R-> Right).

Balance Scale

Here, the objective is to predict one of these directions in which the balance tip will move based on features such as left-weight, left-distance, right-weight, right-distance.

Link to the notebook is attached. This dataset comes from UCI Machine Learning Repository.

How we approach the problem is as follows:

  1. Initially we will try to see if any data imputation has to be performed/checking missing values/dtypes/info/etc.
  2. Then we will see the distribution of the Target Variable Levels as an inference followed by some visualizations!
  3. Followed by Modelling and then checking the F1-Score as our Evaluation Metric!
  4. Then we will perform a slight improvement using GridSearchCV on our Decision Tree model as part of Hyper-parameter Tuning in search of better results.
  5. Then we conclude with results and comments.

Loading/Analyzing and Preparing the Data!:

  • While loading the data, we found that the column names were missing, now that can be a problem when we would do any kind of analysis. Hence in pandas, we can pass a list of column names if we know the sequence.
balance_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-'+'databases/balance-scale/balance-scale.data',sep= ',', header = None,
names =['class_name','left_weight','left_distance','right_weight','right_distance'])
  • Our data looks something like this:
balance_data.head()
  • Moving on, we viewed the shape and info of the dataset loaded!
print(“Dataset Length: “, len(balance_data))
print(“Dataset Shape: “, balance_data.shape)
print(balance_data.info())
  • So, the shape is (625,5) and without null values it seems. Also, since, this is much of a standard dataset, our predictor variables are in numerical format and target variable is a MULTICLASS VARIABLE!
  • Let’s see that how many levels do we have in our Target Variable, i.e. class_name and how much their distribution is contributed towards those 625 rows!
# Exploring the Target Variabletarget_var = balance_data.iloc[:,0]
print(target_var.value_counts())
# Let's check the Class Name Distributionprint('% i Class L',round(target_var.value_counts()[0]/len(target_var)*100,2))
print('% i Class R',round(target_var.value_counts()[1]/len(target_var)*100,2))
print('% i Class B',round(target_var.value_counts()[2]/len(target_var)*100,2))
  • Result:
  • Number of Instances: 625 (49 balanced, 288 left, 288 right)
  • Class Distribution:
  1. 46.08% are L
  2. 46.08% are R
  3. 7.84% are B
  • So, we can see that BALANCE scale appeared to be very less number of times. Should we consider it as imbalanced scenario? Let me know in the comment section and the remedy for the same!
  • Since, everything looks good, we will try a pair plot to understand the dependency of the features over each other!
# Pairplot
sns.pairplot(balance_data);
plt.title(‘Pair Plot!’);
  • Clearly, we can see no dependency among the features!. Similarly, i would like to share a snip of code, using which you can find number of missing values in your data frame, along with the percentage of missing values and dtype of that feature!
# Checking for Missing valuesdef check_missing_values_function(df):
missing_values = df.isnull().sum()
missing_values_percent =round(100*(df.isnull().sum()/len(df)),2)
dtype = df.dtypes
info_df = pd.DataFrame({'Missing Value Count':missing_values,'% of Missing':missing_values_percent,'DTYPE': dtype})
info_df.sort_values(by = '% of Missing',inplace=True,ascending=False)
return info_df
check_missing_values_function(balance_data) # Calling the function on your DF
  • Result:
Looks nice na!!
  • Now in order to support our pairplot independent result, we draft out a correlation matrix also with a heatmap plot!
sns.heatmap(balance_data.corr(),annot=True);
plt.xticks(rotation = 30)
plt.yticks(rotation = 30)
plt.title(‘Correlation Plot!’);

Modelling:

  • Before modelling, if you are thinking that i might now have scaled the data and then also, i’m going for modelling!. Well, to answer this: Decision trees and Random Forests are immune to the feature magnitude and hence its not required.
  • Why?: Well, a node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There’s no regularization for the threshold (because one should keep height of the tree small), so it’s not affected by different scales.
  • Now, let’s split our our data. Let’s make it in 70:30 format
# Separating the target variable 
predictor_variables = balance_data.values[:, 1:5]
target_variables = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(predictor_variables, target_variables, test_size = 0.3, random_state = 100)
  • Let’s start building our model!. Starting with CART/GINI INDEX method.
# Modelling using the GINI INDEX Score methodology
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3,
min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
DT Classifier for GINI INDEX
  • Now, in order to evaluate any kind of result. What should be our metric!. Yup, you got it right!. We are talking about Accuracy and F1-Score to be our evaluation metrics. So, let’s see them using a Confusion Matrix, so that we can get a gist of our final results!
# Making predictions on the TEST SET
y_pred = clf_gini.predict(X_test)
# Since, this is a MultiClass Classification Problem, We need to see the CLASSIFICATION REPORT
print("Confusion Matrix: ")
print(confusion_matrix(y_test, y_pred))
print()
print ("Accuracy : ",accuracy_score(y_test,y_pred)*100)
print()
print("Report : ", classification_report(y_test, y_pred))
  • Result:
Confusion Matrix
  • So, we get Accuracy somewhere 73% and F1-Score around 51%, which is quite low!
  • Let’s try ENTROPY/ID3 methodology also to find if we get better results!
# Decision tree with entropy 
clf_entropy = DecisionTreeClassifier(criterion = "entropy",
random_state = 100,max_depth = 3,
min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
# Making predcitions on the TEST SET
y_pred = clf_entropy.predict(X_test)
DT Classifier for ENTROPY Split
  • Evaluation Results!
  • Oh! oh! DECREASE — DECREASE — DECREASE. Now can you think of anything else, where can we increase our ACCURACY.🤔💭

Hyperparameter Tuning!

  • Now, before we move onto building a better model, which SPLITTING METHODOLOGY should we choosing? It is GINI method because that’s where our model is splitting much better and we are able to achieve high end result!
  • There are multiple ways to perform this tuning. One of the shortest and best way is to use GridSearchCV and apply range of values of certain parameters.
Visual working of CV Methodology!
  • Here we will use our eval metric to be F1 Score!. So when we will be looking at best score, then we will be referring to F1-Score and not accuracy!
param_grid = {
'max_depth': [2,5,8,11,14,17,20],
'min_samples_leaf': [5,10,15,20],
'min_samples_split': [10,20,30],
'max_features': [1, 4]
}
dt_hyper = DecisionTreeClassifier(class_weight='balanced',criterion = "entropy")grid_search = GridSearchCV(estimator = dt_hyper, param_grid = param_grid,refit=['f1_score'] ,
cv = 5, n_jobs=-1, return_train_score=True,verbose=1)
grid_search.fit(X_train, y_train)
  • Result:
print('We can get accuracy of',round(grid_search.best_score_,2),'using',grid_search.best_params_)
  • Let’s build our final Model then:
# Modelling using the GINI INDEX Score methodology
dt_final_model = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=11, min_samples_leaf=5,min_samples_split=10,max_features = 4)
# Performing training
dt_final_model.fit(X_train, y_train)
# Making predcitions on the TEST SET
y_pred = dt_final_model.predict(X_test)
# Since, this is a MultiClass Classification Problem, We need to see the CLASSIFICATION REPORT
print("Confusion Matrix: ")
print(confusion_matrix(y_test, y_pred))
print()
print ("Accuracy : ",round(accuracy_score(y_test,y_pred)*100,2))
print()
print("Report : ", classification_report(y_test, y_pred))
print()
print('F1 Score: ',round(f1_score(y_test, y_pred, average='micro')*100,2))
# 'micro':
# Calculate metrics globally by counting the total true positives, false negatives and false positives.
  • Final Confusion Matrix with Results!

GREAT! We actually got a much better working model with over 81% of accuracy. Now, if we had more data in terms of number of rows, this complex model could learn more!.

  • This kind of Hyper tuning can be performed over any kind of model, be it Random Forest/Logistic Regression/XG Boost/etc.

Conclusion

  • Finally at this stage when we did EDA/Modelling/Hyper-tuning, what about the features?
  • How do we know which feature is more important to model. Can somehow we can see that feature importance?
  • Yes, using model.feature_importances_
# Creating feature name vs feature value df
features = [dt_final_model.feature_importances_]
feature_names = ['left_weight','left_distance','right_weight','right_distance']
feature_imp_df = pd.DataFrame(columns=feature_names,data= features).reset_index(drop=True)
feature_df = pd.DataFrame(feature_imp_df.T.reset_index())
feature_df = feature_df.rename(columns = {'index':'feature_names',0:'feature_imp_value'}).sort_values(by = 'feature_imp_value').reset_index(drop=True)
feature_df
  • We can see that the most important feature while modelling turned out to be right_distance and least important is right_weight.
  • Let’s plot it out!
plt.figure(figsize=(12,8))
sns.barplot(x = feature_df.feature_names,y = feature_df.feature_imp_value);
plt.rcParams.update({'font.size': 22})
Feature Importance Graph!
  • You can see that there is a 3% Accuracy difference when we use the Entropy Method, however, the F1 Score, which is the trade Off which we should be looking seems more desirable for each class!
  • We tried a Grid Search then using the GINI Method, to find the best Hyperparameters!
  • Here, our Hyperparameter Tuning was more focussed on Getting a Better Macro F1 Score! -> 72% overall

Pros of DT:

  1. Decision trees requires less effort for data preparation.
  2. A decision tree does not require normalization/scaling of data as explained above.
  3. Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent.
  4. They are not affected by outliers!. Which is a great relief.
  5. A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.
  6. One can get feature importance as result to explain the driving factors for the model and make conclusive results.

Cons of DT:

  1. A small change in the data can cause a large change in the structure of the decision tree causing instability. High Variance!
  2. Decision tree training is relatively expensive as the complexity and time have taken are more.
  3. The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.
  4. Decision tress tend to overfit sometimes and hence we need to average results of multiple DT’s, which leads to concept of RANDOM FORESTS!
  • Mostly interview questions are raised from the points mentioned in pros and cons only. Technical questions like splitting of nodes from CART and ID3 are asked to explain!

I hope you enjoyed this journal with brief explanation and quick hands on to get the gist of Decision Trees Algorithm!

Give a thumps up/clap if you believed it in and till then stay tuned, stay safe, keep smiling.!

Connect me at:

Happy Learning🌟

--

--