Expressing Neural Networks as Decision Trees

Explaining how any Neural Network with an Activation can be Expressed as an Equivalent Decision Tree

Published in

deMISTify

11 min readNov 26, 2022

Deep Learning and Machine Learning are two of the most researched topics in tech right now. As a result of their booming nature, many large corporations have been looking towards implementing machine learning models into their workplace. However, machine learning has a wide domain, so let’s break that down :)

Machine learning is broken down into many different subsets, but one of the most commonly-used subset is neural networks. The big idea behind neural nets is that they are used to help machines learn and have been heavily influenced by the way our brain works. More specifically, neural networks have a similar model to how our brain’s neurons send signals to one another. As such, neural networks have gained lots of traction and success in the past.

If you want to learn more about neural networks, check out my other article here.

However, neural networks have a very ‘black-box’ nature, as it’s possible to approximate any function using them, but analyzing their structure won’t give you any insights or takeaways on the composition of the function that is being approximated. In other words, we aren’t aware of how every single neuron works together to arrive at the final output. As a result, this has prevented neural networks from being adopted at a global scale.

In order to fully grasp the entire concept of how neural networks make decisions, a categorization method is used to understand numerous approaches; saliency maps, approximation by interpretable methods, and joint methods.

Saliency Maps

Saliency maps are a method to highlight areas on the input that the neural network will use the most in order to make a prediction/generate an output. In other words, it’s the region in the input that people’s eyes focus on first. For example, when looking at a picture of a person, our eyes tend to focus on their face first. This entire concept of Saliency Maps have stemmed from the concept of saliency in images; Saliency are the unique features within the image. These unique features highlight the foundation of the image. Saliency maps are a topographical representation of them. Research within this field has shown promise, but many papers have concluded that the saliency maps that are obtained are noisy and it prevents analysts from understanding the entire decision making process.

Interpretable Models

Research has shown there is a direct correlation between interpretable by design models (i.e Decision Trees) and neural networks. In 2018, a paper was published (Deep neural network initialization with decision trees) that created a potential method to initialize neural networks with decision trees. These neural networks that have been used have had a very specific structure and architecture. Thus, the findings cannot be extrapolated to a generalized model just yet. There have been many other papers that have shown some promise, but the output has been poor.

Joint Models

Lastly, joint neural networks and decision tree models have used deep learning to create neural networks with a structure that resembles a tree. Recent papers such as Nbdt: neural-backed decision trees have discussed replacing the final connected layer of a neural network with a decision tree. However, a common trend among all these methods is that the core foundational principles of neural networks have served as the backbone of all these papers. As a result, a human has to verify and validate the decision, whether it was a positive or negative result.

However, a recent paper has found out that any neural network with any activations can be expressed as an equivalent decision tree. As a result, the tree representation doesn’t limit or require modifying the architecture of the neural network in any way. As mentioned previously in the article, one of the biggest problems with neural networks is its black box nature, but with decision trees, this problem can be tackled by ‘analyzing the category that a test sample belongs to, which can be extracted by the node rules that a sample is categorized.’

Decision Tree Analysis of Neural Networks

In this part of the article, we’re going to dive into feed-forward neural networks, primarily with piece-wise linear activation functions such as ReLU (rectified linear unit), Leaky ReLU, etc.

Fully Connected Networks

To understand how these neural networks can be expressed as a decision tree, it is vital to understand the math behind these neural networks. Starting off with combining the weight matrix, piecewise linear activation function and the input to the neural network.

Wᵢ: weight matrix, σ: piece-wise linear activation function, x₀: input of the neural network

In this equation, the activation function σ is acting as an element-wise scalar multiplication. Element-wise scalar multiplication is a binary operation that takes 2 matrices and produces/computes a new matrix with elements that are the products of the 2 corresponding matrices. As such, this equation can be rewritten as follows:

In this equation, the aᵢ₋₁ is a vector that refers to the slope of the activations in the relative linear portions where Wᵗᵢ₋₁ xᵢ₋₁ exist. The operator before Wᵗᵢ₋₁ xᵢ₋₁ refers to the element-wise multiplication, as discussed previously. Another interesting thing to note is that aᵢ₋₁ can be interpreted as a categorization result as it includes slopes of specific linear regions within the activation function. As such, we can rewrite this equation as follows:

In this next equation, the operator in the equation is used to represent a column-wise element-wise multiplication on Wᵢ . This is directly related to the element-wise multiplication we discussed earlier. By repeating the aᵢ₋₁ vector, we are able to obtain a column-vector that matches the size of Wᵢ. Now we can use equations 3 and 1 to achieve the following equation:

From this equation, we can apply an effective weight matrix directly on the input X₀. Thus, we can obtain the following:

In this equation, we can define the categorization vector with the following:

The || refers to the concatenation operator

From equation 5, we can see that the effective matrix of layer i is dependent on the categorization vectors derived from previous layers. In other words, this means that each new layer has a new filter that is applied onto the network input. The filter is selected based on the decisions that have been made prior to layer i. This is very similar to how decision trees work, as the decision in the branch before determines what is to come next.

This is a very rough understanding and visualization of these decision trees. We can illustrate the decision tree with more clarity by creating an algorithm with equation 5. Given our previous knowledge, we can write the algorithm with the ReLU activation function. In the following algorithm, a for loop is used and it relates to a node in the decision tree; in this case, a simple yes/no decision is made. As such, the following algorithm was produced in this paper:

Using this algorithm, a decision tree equivalent can be constructed once again by creating another algorithm. This algorithm is obtained from a neural network with 3 layers, having 2, 1 and 1 filter for layer 1, 2, and 3 respectively. As mentioned previously, we’re going to be working with ReLU activations and this neural network has ReLU activation between the layers and there isn’t any activation after the 3rd/final layer. Here’s the algorithm along side a visualization of the decision tree.

The final part of visualizing and expressing a neural network as a decision tree is to ensure that there exists a decision tree equivalent for skip connections and normalizations.

Skip Connections

The analysis of skip connections uses a similar breakdown to what we did above with the feed-forward neural network. We first start off with a residual neural network, expressed by the following equation:

With this equation, we can do something similar to what we did in the previous section and we can express ᵣxᵢ with the aᵢ₋₁ vector. Thus, it can be rewritten as follows:

Earlier in the article, an effective matrix was used and similarly, effective matrices can be sued here on residual neural networks to rewrite the equation in this form:

This equation is very similar to equation 5; we can see that the residual effective matrix is only defined based on the previous activation. As such, we can once again express this portion of the neural network as a decision tree equivalent.

Normalization Layers

This part is relatively simple to understand as most of the common normalization layers that are employed are linear. Normalization layers essentially normalize each individual input across all features. This normalization process is very linear as the batch normalization acts on what comes out of the previous layer. This can easily be expressed as a decision tree as the normalization is determined either pre-activation or post-activation.

The last part of this paper discusses the outcomes of the experiments that they ran. They showed a lot of promise and here were the results:

Experimental Results

In this experiment, a neural network was fit into a y = x² equation; the neural network had 3 dense layers, each with 2 filters, except the last layer which had 1 filter. The activation function that was used was leaky-ReLU after every fully connected layers, excluding the last one. The main purpose of employing leaky-ReLU was because this activation function has a small slope for negative values. The regular ReLU function usually has a horizontal tangent/flat slope for these negative values. This neural network was trained with 5000 (x,y) pairs, but x was limited to values from [-2.5, 2.5]. Here was the visualization of the decision tree:

Every black box in this decision tree represents a rule in the tree. For some context, the decision tree algorithm is based on conditional probabilities. Thus, a rule in a decision tree is a conditional statement that can be understood by us and it is also employed within a database to assist in identifying a set of records. Following these black boxes, we have child nodes, the left child means that the rule doesn’t hold and the right child means the rule holds.

In this network, each leaf has a linear function applied to it based on the decisions made so far; they are indicated by the red rectangular boxes. However, one big problem that can be noticed is that the tree representation is quite big and involves many categorizations. However, there are some decisions within the decision tree that are redundant. For example, we checked if x < -1.16 and then checked again if x < 0.32, which is a little redundant. As a result, we are able to create invalid left child nodes, thus the entire tree can be cleaned and condensed by removing the left child. By doing so, we can get a more simplified version of the decision tree which only consists of 5 categories, whereas the previous tree had 16 categories. The 5 categories can be visualized as follows:

From this point forward, the interpretation of the neural network is much clearer. For each region whose boundaries that are determined by the decision tree, they can be approximated by a linear equation, as seen in figure 3b above. This was a very generalized example, but a more in-depth example was explored involving half-moons.

Half-Moon Classification as Decision Trees 🌓

In this example, a fully connected neural network is trained with 3 layers with leaky-ReLU activations, but the last layer has a sigmoid activation. A sigmoid activation function is a sigmoid function that guarantees the output will always be between 0–1. Each layer in this case has 2 filters, except the last one, which has 1 filter. After cleaning the redundant parts of the decision tree, here is what the condensed decision tree looks like:

Earlier in the article, we talked about defining specific boundaries with the decision trees. In this example, the decision tree is able to find many categories where the boundaries are determined by the rules throughout the tree. These categories can be illustrated with different colors:

Overall, the decision tree has some portions that are very well-defined and bounded, as such, the output/classification is in-line with the training data, thus, these regions are very reliable. However, the opposite also exists, where some portions aren’t as well-defined, thus they fail to provide a compact representation of the training data. This results in inaccuracies made by neural decisions.

Overall, the decision tree surprisingly brings computational advantages to the table. In the following table, the number of parameters, float-point comparisons and multiplication or addition operations are compared.

All these values are ‘expected values’ since each category depth varies for different trees. However, it can be see that the number of parameters in the tree representation is larger than of the neural network. In the tree, the filters are applied directly on the input, whereas in neural networks, the filters are applied on the previous features. These filters are much larger than the input in the feature dimension. Thus, we can conclude that neural networks can not only be represented as a decision tree, but it also has a computational advantage over neural networks. With that being said, extrapolating these findings will be crucial in tackling the black-box nature of neural networks.

If you enjoyed this article, feel free to give it claps and share it! You can catch me on LinkedIn and if you want to check out some of my other work, here’s my personal website :)

References

“11 Decision Tree.” Decision Tree, https://docs.oracle.com/cd/E18283_01/datamine.112/e16808/algo_decisiontree.htm.
Aytekin, Caglar. Neural Networks Are Decision Trees — Arxiv.org. https://arxiv.org/pdf/2210.05189.pdf.
Sharma, Abhishek. “What Are Saliency Maps in Deep Learning?” Analytics India Magazine, 10 Oct. 2020, https://analyticsindiamag.com/what-are-saliency-maps-in-deep-learning/.
Vijayrania, Nilesh. “Different Normalization Layers in Deep Learning.” Medium, Towards Data Science, 14 Mar. 2021, https://towardsdatascience.com/different-normalization-layers-in-deep-learning-1a7214ff71d6#:~:text=One%20important%20thing%20to%20note,are%20equally%20centered%20around%20zero.