> Was data set split into training and test sets or a crossval design used?
All the data is used for training, because there was no need for cross-validation (there is no noisy labels: the truth is always the same, so the accuracy in training will always be identical to validation).
> If all the data was used for training, I’d be interested to see what would happen with generalization to new data.
The experimental setup was for testing how trees are fitting to data, but generalization can be the subject of another topic.
> If accuracy is the evaluation metric, this might be expected by chance, right? Would balanced accuracy or ROC/AUC score be a better evaluation metric?
Balanced accuracy and ROC/AUC were not used because of the experimental setup.
ROC/AUC (probability prediction order) would matter a lot more than (balanced) accuracy when interested at how trees are fitting noisy data, but is slightly harder to understand than accuracy alone. Noisy data tree fitting is currently the subject of another research I’m doing.
You nailed exactly what I’m looking to research in the near future =)