Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery (DMLR)

Published in

OMRON SINIC X

3 min readMar 7, 2024

We are pleased to announce that our paper on symbolic regression datasets and benchmarks for scientific discovery was accepted at the Journal of Data-centric Machine Learning Research (DMLR), a new sister journal of the JMLR.

TL; DR

The paper is about symbolic regression for scientific discovery (SRSD). Symbolic regression (SR) is a task of producing a mathematical expression (symbolic expression) in a human understandable manner that fits a given dataset. We pointed out a number of issues in the current symbolic regression datasets and benchmarks for data-driven scientific discovery and proposed new datasets and benchmark.

Issues in Existing SR Datasets & Our approach

No physical meaning: E.g., f(x, y) = xy + sin((x - 1) (y - 1))
Oversimplified sampling process: Very narrow sampling ranges e.g.,https://medium.com/sinicx/a-new-transformer-model-for-symbolic-regression-towards-scientific-discovery-be72548014ac U(1, 5)
Duplicate SR problems: Same symbolic expressions + same sampling ranges -> same SR problems. E.g., F = µNₙ and F = q₂E in FSRD are duplicates as all the input variables are sampled from U(1, 5)
Incorrect/Inappropriate formulas: E.g., the difference in number of phases in Bragg’s law should be integer but sampled as real number in FSRD
Ignoring feature selection: All the provided input variables are expected to be used in the true model, but SR methods should be able to detect input variables irrelevant to the true model

We proposed new SRSD datasets (120 new SRSD problems without/with dummy variables) and addressed all the above issues. The following table shows 30 SRSD problems in our SRSD-Feynman (Easy set).

Existing SR metrics

R² score-driven accuracy: percentage of solutions that meet R² > 0.999
Solution rate: percentage of solutions that symbolically match true models
Simplicity: e.g., number of mathematical operators in a solution

None of the single SR metrics considers both 1) interpretability and
2) structural similarity between a solution and a true model

NED: Normalized Edit Distance

We propose a use of normalized edit distance between skeleton equation trees of a solution and a true model. This is a non-binary evaluation metric that assesses how structurally close to the true model the solution is. The metric can take into account both 1) interpretability and 2) structural similarity between a solution and a true model. As an additional evaluation metric, it also can incorporate existing SR metrics.

Preprocessing and converting an equation to a skeleton equation tree

Key Findings from Benchmark Results & User Study

We used gplearn, AFP, AFP-FE, AI Feynman, DSR, E2E, uDSR, and PySR as baseline methods for our 240 SRSD-Feynman datasets.

uDSR and PySR performed the best on our SRSD-Feynman datasets
None of the baseline methods is robust against dummy variables
R² -based accuracy is vulnerable to dummy variables
NED provides a more fine-grained analysis than solution rate does
NED is more aligned with human judges than R² score

Read our paper for more details!

Reference

Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku: “Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery,” Journal of DMLR

[Paper] [Video] [Code]

Datasets:

Relevant Studies at OMRON SINIC X

Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku: “A Transformer Model for Symbolic Regression towards Scientific Discovery” @ NeurIPS 2023 AI for Science Workshop [Medium]
Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku: “SRSD: Rethinking Datasets of Symbolic Regression for Scientific Discovery” @ NeurIPS 2022 AI for Science Workshop