A new Transformer Model for Symbolic Regression towards Scientific Discovery

Ryo Igarashi
OMRON SINIC X
Published in
7 min readMar 7, 2024

This article presents our work available here (OpenReview), which has been accepted at the NeurIPS2023 Workshop on AI for Scientific Discovery. We introduce a new Transformer model tailored for Symbolic Regression in the context of automated / assisted Scientific Discovery. To try our models, head over our GitHub repository.

What is Symbolic Regression?

Symbolic Regression is a complicated task. Usually, regression is done by first assuming a particular equation form. For instance:

  • Linear regression assumes a linear relationship:
$y=ax+b$
  • Polynomial regression assumes that:
$y = \sum_{n=0}^{N} a_n x^n$
  • Logistic regression (for binary variables) assumes that:
$p(y) = \frac{1}{1+e^{-(ax+b)}}$
  • Power law regression assumes that:
$y = x^{\alpha}$
  • etc…

Instead, Symbolic Regression does not impose any functional form. Given a numerical dataset, the goal is to find an analytical expression that explains the dataset without any assumption. With Symbolic Regression, the function skeleton is first estimated, e.g.

$y = a_1 x² + \ln(x+a_2)$,

and then numerical constants are optimized.

How to do Symbolic Regression with a Transformer model?

Naturally, Transformers are Large Language Models (LLMs) for sequence-to-sequence tasks, i.e. they take as input a sequence of tokens, and output another sequence of tokens. Therefore, one might wonder: how to do Symbolic Regression with a Transformer model?

For Symbolic Regression, the input of our Transformer model now consists of a tabular dataset of numerical values (see Figure 1). The task is to predict an analytical expression that describes the proposed dataset.

Figure 1: Schematic diagram of Symbolic Regression

We introduce a fixed vocabulary of tokens available to represent mathematical expressions and functions. Our chosen vocabulary includes the following tokens: [add, mul, sin, cos, log, exp, neg, inv, sq, cb, sqrt, C, x1, x2, x3, x4, x5, x6].

With this vocabulary of tokens, arbitrary mathematical equations can be represented by first turning the equation into a tree. Next, the equation tree is read following the pre-order traversal manner (from top to bottom prioritizing left nodes) such that the ground-truth equation is represented as a unique sequence of tokens. An example of the pre-order traversal sequence is given on Figure 2.

Figure 2: An example of

Architecture of our Transformer Model

The architecture of our model Encoder is different from traditional Transformer models’ encoders. Instead of systematically using multi-head self-attention, we now propose various architectures focusing on building meaningful features for the input numerical dataset (more on that in our paper). We find that more flexible architectures do not preserve column-permutation equivariance, an interesting property for tabular datasets, but allow for better generalization and subsequence performances.

As for the Decoder of our model, its architecture is similar to the traditional Transformer model proposed by Vaswani et al. in their original paper _Attention is all you need_ (2017). The last MLP layer of the Decoder outputs probabilities for the following token. Figure 3 below in taken from our paper and introduces the general architecture of Transformer model.

Figure 3:

Training

To train our model, we generate a lot of correctly labeled training examples. We allocate sampling probabilities to each token in our vocabulary to match naturally occurring frequencies, e.g. mul is more common than sin. For each training equation, we begin by sampling a random token from the vocabulary and continue sampling until the equation tree is complete. We used the SymPy Python library to simplify our ground-truth equations in a consistent way.

We next remove equation duplicates, i.e. equations that share the same skeleton. We then sample each variable and constants several times, to allow for diversity using the same equation skeleton. In the end, we have 1,494,588 tabular datasets available for training.

Training the Transformer model is done by minimizing the categorical cross-entropy loss function over the $v$ possible tokens in the vocabulary:

$\mathcal{L} = — \sum_{i=1}^{v} y_i \log(\widehat{y_i})$

This loss function has only one term which is not zero. Minimizing ℒ essentially consists in making y̅ closer to 1 if i is the correct class and closer to 0 for all other classes. We trained with and without label-smoothing, a common regularization technique, and compared our performances in our paper.

Results compared to state-of-the arts Symbolic methods on the SRSD datasets

We test our best Transformer model on the SRSD (Symbolic Regression for Scientific Discovery) datasets proposed by Matsubara et al. here (2023). The SRSD datasets have been meticulously curated, and have been proposed to represent the diversity of scientific equations already available. The SRSD datasets can be seen as “real world” complicated datasets representing real physical equations, unlike the synthetic datasets generating to train our model. The SRSD datasets comprise 120 datasets, divided into three categories: 30 easy, 40 medium, and 50 hard datasets.

We assess the Symbolic Regression performances by computing the normalized tree-edit distance, as proposed by Matsubara et al:

where d(fₚ;fₜ) is the tree-edit distance (i.e. how many nodes should be added, deleted, or modified to transform one tree into another), and |fₜ| is the number of nodes in the ground-truth equation. The normalized-tree edit distance provides a way of assessing how structurally close from the ground-truth the estimated equation is, and has been shown to be closer to human-intuition.

Figure 4 below shows the normalized tree-edit distance scores (the lower, the better) for our best Transformer model (Best m.) against other state-of-the-art methods for Symbolic Regression.

Figure 4:

For the medium and hard SRSD datasets, the estimated equations by our Transformer model are (on average) structurally closer to their ground-truth than the estimated equations of other SR methods. For the easy SRSD datasets, DSR and AI-Feynman, considered to be good SR algorithms, remain state-of-the-art.

Conclusion, Challenges, and Open Questions

Compared to traditional Genetic Programming approaches, the main strength of our model is its inference time: once trained, our Transformer model can provide almost instantaneous predictions. Also, we showed that our Transformer model provides overall best predictions on the SRSD datasets.

A major challenge when training Transformer models for Symbolic Regression lies in generating a good training dataset. The training dataset will be representative of the equations the Transformer model can correctly predict. Therefore, it should be as diverse and inclusive as possible, while respecting expected natural occurrences: for example, token mul is expected to be more frequent than cos.

The chosen vocabulary of tokens also plays a significant role. For example, our fixed vocabulary does not allow for the tangent tan function. Even if tangent can be represented as tan(x) = sin(x)/cos(x), this corresponds to a sequence of six tokens [mul, sin, x, inv, cos, x], which is much harder to predict that a single token. But the addition of the tan token in our vocabulary could also lead to worse results (the more token, the harder the prediction).

Another difficulty involves the systematic treatment of variables and constant. During the generation of the training datasets, we allowed for a single constant which we later sample to create tabular datasets. But ground-truth equations can have more than one constant, potentially nested inside different functions.

Also, the representation of the ground-truth equations has to be consistent. For example, which one of y = (x₁ + x₂)², y = x₁² + x₂² + 2 x₁ x₂, or y = x₁² + 2 x₁ x₂ + x₂² should our Transformer model predict, although all equivalent expressions?

Besides, we showed that the Encoder architecture plays an important role in Symbolic Regression performances (see our paper). Designing an Encoder as flexible as possible and capable of learning meaningful features for each tabular dataset is crucial to prevent from overfitting and therefore obtain a generalizable learning.

Finally, the difference between in-domain (training datasets) and out-of-domain (unseen ground-truth equations) is probably the most complicated issue with Transformer models for Symbolic Regression. Typically, our model can correctly predict equations seen during training (even during validation and test at training time), but generalization to the SRSD datasets is much harder.

--

--

Ryo Igarashi
OMRON SINIC X

Project Researcher at OMRON SINIC X, connecting computational condensed matter physics, high performance computing, and machine learning. Ph.D. from UTokyo.