About versioning datasets… this is probably self-evident to more experienced people like you. But here’s an error I’ve often made: thinking that I’m just going to clean up this dataset by hand with a couple of commands, and then I can just keep using it. It’s just a small project anyway.
No! Put those commands in a script and put the script under version control! Although you want to use the post-processed dataset as often as possible for efficiency reasons, it’s essential that you have a script for reproducing how you got there. As far back as you can, as long as it’s deterministic. There will be errors in your preprocessing which you don’t notice (at least if you’re me!), and there may be more or better raw data to feed through the pipeline later. Don’t be stuck trying to remember those “short, easy” commands you used, even if it was just for scaling images or converting character sets.