Machine Learning Crash Course: Decision Trees and Random Forests

Code Primer
4 min readJan 14, 2023

--

In this part of the “Machine Learning Crash Course” series, we will be diving into the world of decision trees and random forest, two popular and powerful supervised learning algorithms used for both classification and regression problems. These algorithms are simple to understand, interpret and use and are widely used in industry for a variety of applications such as image recognition, natural language processing, and bioinformatics.

https://blog.mindmanager.com/wp-content/uploads/2022/03/Decision-Tree-Diagram-Example-MindManager-Blog.png

A decision tree is a flowchart-like tree structure that is used to represent a decision process. It is a simple and easy-to-interpret algorithm that can be used for both classification and regression problems. A decision tree works by recursively partitioning the data into smaller subsets based on the values of the input features. Each internal node of the tree represents a feature, and each leaf node represents a class or value. The tree is built by starting at the root node, testing a feature and going to the left or right based on the feature value. This process is repeated until a leaf node is reached.

A random forest is an ensemble of decision trees. It works by training multiple decision trees on different subsets of the data and then averaging their predictions. The main idea behind random forests is to combine the predictions of multiple decision trees to reduce overfitting and improve the overall performance of the model. Random Forest algorithm generates multiple decision trees and then averages the predictions of all the decision trees to provide the final predictions, this is why it is considered more robust and less prone to overfitting than a single decision tree.

An example of when we might use decision trees and random forest is in the case of a bank trying to predict if a customer will default on a loan. The bank could gather data on customer demographics, credit history, and loan information, and use a decision tree or random forest to predict which customers are most likely to default. Both decision trees and random forest are good choice for this problem because they are able to handle a large number of input variables, and they can easily handle non-linear decision boundaries.

In terms of how decision trees are trained, it’s important to note that the algorithm is based on the concept of recursive partitioning of the data. The goal is to find a feature and a threshold and value that results in the most homogeneous subsets of the data, with the least impurity (or how mixed the labels are within the subset). The most commonly used measure of impurity is the Gini impurity (a metric to measure how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset). The algorithm then continues to split the subsets based on the feature that results in the lowest Gini impurity until a stopping criterion is met, such as reaching a maximum depth or a minimum number of samples in a leaf node.

Now, let’s take a look at how to visualize decision tree and random forest in python. We will use the library graphviz to visualize the decision tree and matplotlib to visualize the feature importance of random forest.

# Load the data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

# Import the required libraries
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import graphviz
import matplotlib.pyplot as plt

# Create an instance of the DecisionTreeClassifier class
clf = DecisionTreeClassifier(criterion='gini')

# Fit the model using the training data
clf.fit(X_train, y_train)

# Visualize the decision tree
dot_data = export_graphviz(clf, out_file=None, feature_names=[f"feature{i}" for i in range(X_train.shape[1])], class_names=[f"class{i}" for i in range(y_test.shape[0])])
graph = graphviz.Source(dot_data)
graph.render("decision_tree")

# Create an instance of the RandomForestClassifier class
clf = RandomForestClassifier(n_estimators=100)

# Fit the model using the training data
clf.fit(X_train, y_train)

# Visualize the feature importance
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
plt.barh(range(X_train.shape[1]), importances[indices])
plt.yticks(range(X_train.shape[1]), [X_train[0][i] for i in indices])
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.show()
resulting feature importance-feature diagram

As you can see, we used the graphviz library to visualize the decision tree and matplotlib to visualize the feature importance of random forest. The visualization of decision tree will show you how the tree is making decisions and the feature importance will show you which features are more important for the final prediction.

In conclusion, decision trees and random forests are two powerful supervised learning algorithms that can be used for both classification and regression problems. They are easy to interpret, train, and use, and are widely used in industry for a variety of applications such as image recognition, natural language processing, and bioinformatics. They are also considered as one of the best models for feature importance. Try to apply these algorithms on some of your own datasets, and see how it performs.

Previous Part:

--

--

Code Primer

Welcome to Code Primer! Find easy-to-follow coding tutorials for beginners and experienced developers alike. Covering a variety of topics, we've got you covered