As larger models require pretraining on trillions of tokens, it is unclear how scalable is curation of…