Trees applied to Reduced Order Modelling and Operator Learning

Published in

SISSA mathLab

7 min readMar 17, 2024

Introduction

On the 4th February 2024 I participated as a staff member at a hackathon sponsored by a SISSA startup, a private company, two universities and volunteer organization to which I belong. The hackathon consisted in creating a robust predictive model on some synthetic data. It was funny. The interesting part was that some participants tried both tree and deep learning models and found out that the trees had significantly better performance. This inspired me to try trees on a Lid Cavity simulation dataset (on which I was benchmarking non-intrusive models) on that evening. The results were extremely promising. After that I decided to study the literature of tree-based models applied to reduced order modelling, and I surprisingly found that were very few papers about this. For this reason, I decided to write this medium post: to show that they can perform well on reduced order modelling and operator learning tasks. If you are not familiar with reduced order modelling, please check my previous medium post. For the operator learning part, a basic knowledge of probability measure theory is required.

The baseline: the decision tree

From a scikit-learn description [2]: “Decision trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.” We will mainly focus on regression.

Fig 1. Example of a regression tree of depth 2.

A regression decision tree at every node selects a feature and a value, and splits the data in two parts, one that has that feature lower than the value and one that has the feature greater than the value. The process is repeated recursively.

Let me remark that decision trees, like neural networks, have the universal approximation property: they can approximate any continuous function if they are deep enough. Differently from neural networks, which can have arbitrary width and depth, the width of a tree can be at most 2^(depth). The advantages of the trees are:

They use simple rules and can be visualized, so they are white box models and are more interpretable than neural networks.
They have logarithmic cost over the dataset size, while neural networks have linear cost.
Their training is extremely fast with respect to neural networks. A large network often requires hours to train, a large tree requires minutes.
They can be easily trained in parallel, while neural networks can’t.

Disavantages:

It is much easier to overfit with a tree with respect to a neural network.
The training is in general not stable, i.e. small variations in the data can change the output tree structure.
Doing exact Physics Informed training using trees is completely impossible, as they are inherently non differentiable.

We will now see some tree extensions which address some of the disadvantages.

Extensions: ensembles and pruning

Model ensembling is the practice to predict something using multiple models and then to have as final prediction the average of all the predictions. For tree-based models, we have:

Random forests: each tree is constructed on a subset of all the features, and with exhaustive search on the simple rules, which are chosen.
Extremely randomized trees: like random forests, but the simple rules are not created using a discriminative criterion. Instead, some simple rules are sampled randomly for every node and the best one is chosen.
Bagging: like random forests, but every tree is constructed using all the features.
Boosting: a sequence of trees. A tree is trained on the residual of the previous tree in the sequence. It is inspired from Gradient descent (which can’t be applied as tree are not even continuous). It requires a learning rate so it is similar to deep neural networks in spirit.

Consequently, ensemble models are more stable with respect to data variations, and they also reduce overfitting.

Another methodology that can be applied to reduce overfitting is pruning, which is the consequent removal of a subtree if it does not influence too much the prediction.

Operator Learning and approximation

An operator is a map from an infinite dimensional vector space to another infinite dimensional vector space. We will consider the infinite dimensional vector spaces of continuous functions defined on a compact, connected subset of R^n. In formula:

An example of an operator is the convolutional operator:

The problem of operator learning is the following:

A problem naturally arises. When solving PDE problems, we usually have only evaluations of functions at certain points, typically:

Furthermore, supervised learning models typically depend on finite dimensional parameters. You may ask yourself if this is enough to perform inference on infinite dimensional space. The answer is affirmative, as the following theorem ([3]) shows:

So, thanks to Monte Carlo Integration, the problem can be rewritten as

So we have arrived at a classical supervised problem: the fitting of functions on finite dimensional spaces by minimizing the loss above. The f_k and the g_k can be approximated using universal approximators. There are two ways to approach this problem:

The first one is to learn the f_k and the g_k together. This is the approach that is usually taken when the universal approximators are neural networks. The problem is that in this case the memory scales with N*B*A, so cubically. For neural networks, this is not a problem, as mini-batching exists. However, trees (and other classical techniques like GPR and RBF) are usually trained in a full batch way, so training is impossible for medium size datasets, and we need to use another approach.
Try to learn the f_k and the g_k separately. In this case the amount of memory scales with N*(A+B), so quadratically. The problem is that we need to compute some intermediate outputs on which to train f_k and g_k. I will now focus on how to do it.

Note that, the minimization problem of the loss can be written equivalenty as:

Note that W has rank at most p. So by the Eckart-Young-Minsky theorem ([5]) we have that

An approximation of the Singular Value Decomposition (SVD) ([4]) of V can be computed efficiently even for very large matrices using libraries like EzyRB or PINA, developed here at SISSA mathLab.

As a consequence, it is natural to use the coefficients of V^{star} as intermediate inputs and to solve the minimization problems:

All of these minimizations problems can be solved using classic supervised machine learning algorithms.

Let’s now see two test cases.

First Test case: parametric elliptic equation

Let’s now test these models on the parametric elliptic equation of the previous post. The tree-based models are compared with a Feed Forward Neural Network and with a POD+Neural Network.

Fig 2. Training error with increasing size of neural network and tree based models. The decision tree model and the extra randomized tree outperform the neural network.

As you can see the decision tree model and the extra randomized tree model outperform the neural network.

Test Case 2: Lid Cavity Dataset

Our second test case is a Lid Cavity simulation dataset (the one of the introduction) composed of solution of the incompressible unsteady Navier Stokes equation with different values of low Reynolds number. In this scenario a simple decision tree in two hours of fine-tuning outperformed every neural network based model that I fine-tuned in approx. 20 days. In the video below you can see the actual performance. The tree faithfully reconstructs the data.

Fig 3. On the left side the original simulation. On the right side the online prediction using trees. Upscaled using matplotlib.

Test Case 3: Navier Stokes Operator Learning

Our last case is the simulation of the vorticity of a periodic unsteady Navier Stokes with periodic boundary conditions and boundary conditions sampled using different samples of gaussian random field (which is a distribution of functions) and a Reynolds number of 1000. Also in this case the tree shows good performance. The time to solve all the Navier Stokes took approx. 30 hours. The tree learning took 3 minutes and the reconstruction 30 seconds. This is a considerable speed up.