Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery (DMLR)
We are pleased to announce that our paper on symbolic regression datasets and benchmarks for scientific discovery was accepted at the Journal of Data-centric Machine Learning Research (DMLR), a new sister journal of the JMLR.
TL; DR
The paper is about symbolic regression for scientific discovery (SRSD). Symbolic regression (SR) is a task of producing a mathematical expression (symbolic expression) in a human understandable manner that fits a given dataset. We pointed out a number of issues in the current symbolic regression datasets and benchmarks for data-driven scientific discovery and proposed new datasets and benchmark.
Issues in Existing SR Datasets & Our approach
- No physical meaning: E.g., f(x, y) = xy + sin((x - 1) (y - 1))
- Oversimplified sampling process: Very narrow sampling ranges e.g.,https://medium.com/sinicx/a-new-transformer-model-for-symbolic-regression-towards-scientific-discovery-be72548014ac U(1, 5)
- Duplicate SR problems: Same symbolic expressions + same sampling ranges -> same SR problems. E.g., F = µNₙ and F = q₂E in FSRD are duplicates as all the input variables are sampled from U(1, 5)
- Incorrect/Inappropriate formulas: E.g., the difference in number of phases in Bragg’s law should be integer but sampled as real number in FSRD
- Ignoring feature selection: All the provided input variables are expected to be used in the true model, but SR methods should be able to detect input variables irrelevant to the true model
We proposed new SRSD datasets (120 new SRSD problems without/with dummy variables) and addressed all the above issues. The following table shows 30 SRSD problems in our SRSD-Feynman (Easy set).
Existing SR metrics
- R² score-driven accuracy: percentage of solutions that meet R² > 0.999
- Solution rate: percentage of solutions that symbolically match true models
- Simplicity: e.g., number of mathematical operators in a solution
None of the single SR metrics considers both 1) interpretability and
2) structural similarity between a solution and a true model
NED: Normalized Edit Distance
We propose a use of normalized edit distance between skeleton equation trees of a solution and a true model. This is a non-binary evaluation metric that assesses how structurally close to the true model the solution is. The metric can take into account both 1) interpretability and 2) structural similarity between a solution and a true model. As an additional evaluation metric, it also can incorporate existing SR metrics.
Key Findings from Benchmark Results & User Study
We used gplearn, AFP, AFP-FE, AI Feynman, DSR, E2E, uDSR, and PySR as baseline methods for our 240 SRSD-Feynman datasets.
- uDSR and PySR performed the best on our SRSD-Feynman datasets
- None of the baseline methods is robust against dummy variables
- R² -based accuracy is vulnerable to dummy variables
- NED provides a more fine-grained analysis than solution rate does
- NED is more aligned with human judges than R² score
Read our paper for more details!
Reference
Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku: “Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery,” Journal of DMLR
Datasets:
- SRSD-Feynman Easy
- SRSD-Feynman Medium
- SRSD-Feynman Hard
- SRSD-Feynman Easy w/ Dummy Variables
- SRSD-Feynman Medium w/ Dummy Variables
- SRSD-Feynman Hard w/ Dummy Variables
Relevant Studies at OMRON SINIC X
- Florian Lalande, Yoshitomo Matsubara, Naoya Chiba, Tatsunori Taniai, Ryo Igarashi, Yoshitaka Ushiku: “A Transformer Model for Symbolic Regression towards Scientific Discovery” @ NeurIPS 2023 AI for Science Workshop [Medium]
- Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, Yoshitaka Ushiku: “SRSD: Rethinking Datasets of Symbolic Regression for Scientific Discovery” @ NeurIPS 2022 AI for Science Workshop