Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery (DMLR)

Yoshitomo Matsubara
OMRON SINIC X
Published in
3 min readMar 7, 2024

We are pleased to announce that our paper on symbolic regression datasets and benchmarks for scientific discovery was accepted at the Journal of Data-centric Machine Learning Research (DMLR), a new sister journal of the JMLR.

TL; DR

The paper is about symbolic regression for scientific discovery (SRSD). Symbolic regression (SR) is a task of producing a mathematical expression (symbolic expression) in a human understandable manner that fits a given dataset. We pointed out a number of issues in the current symbolic regression datasets and benchmarks for data-driven scientific discovery and proposed new datasets and benchmark.

Issues in Existing SR Datasets & Our approach

  1. No physical meaning: E.g., f(x, y) = xy + sin((x - 1) (y - 1))
  2. Oversimplified sampling process: Very narrow sampling ranges e.g.,https://medium.com/sinicx/a-new-transformer-model-for-symbolic-regression-towards-scientific-discovery-be72548014ac U(1, 5)
  3. Duplicate SR problems: Same symbolic expressions + same sampling ranges -> same SR problems. E.g., F = µNₙ and F = qE in FSRD are duplicates as all the input variables are sampled from U(1, 5)
  4. Incorrect/Inappropriate formulas: E.g., the difference in number of phases in Bragg’s law should be integer but sampled as real number in FSRD
  5. Ignoring feature selection: All the provided input variables are expected to be used in the true model, but SR methods should be able to detect input variables irrelevant to the true model

We proposed new SRSD datasets (120 new SRSD problems without/with dummy variables) and addressed all the above issues. The following table shows 30 SRSD problems in our SRSD-Feynman (Easy set).

SRSD-Feynman Easy (30 SRSD problems)

Existing SR metrics

  1. score-driven accuracy: percentage of solutions that meet > 0.999
  2. Solution rate: percentage of solutions that symbolically match true models
  3. Simplicity: e.g., number of mathematical operators in a solution

None of the single SR metrics considers both 1) interpretability and
2) structural similarity between a solution and a true model

NED: Normalized Edit Distance

We propose a use of normalized edit distance between skeleton equation trees of a solution and a true model. This is a non-binary evaluation metric that assesses how structurally close to the true model the solution is. The metric can take into account both 1) interpretability and 2) structural similarity between a solution and a true model. As an additional evaluation metric, it also can incorporate existing SR metrics.

Preprocessing and converting an equation to a skeleton equation tree

Key Findings from Benchmark Results & User Study

We used gplearn, AFP, AFP-FE, AI Feynman, DSR, E2E, uDSR, and PySR as baseline methods for our 240 SRSD-Feynman datasets.

  1. uDSR and PySR performed the best on our SRSD-Feynman datasets
  2. None of the baseline methods is robust against dummy variables
  3. -based accuracy is vulnerable to dummy variables
  4. NED provides a more fine-grained analysis than solution rate does
  5. NED is more aligned with human judges than score

Read our paper for more details!

Reference

Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku: “Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery,” Journal of DMLR

[Paper] [Video] [Code]

Datasets:

Relevant Studies at OMRON SINIC X

--

--

Yoshitomo Matsubara
OMRON SINIC X

ex-Applied Scientist at Amazon and an ML OSS developer. PhD in Computer Science. https://yoshitomo-matsubara.net/