Member-only story
Start Version Controlling your Machine Learning Datasets
Make your machine learning and data science projects reproducible with open source tools.
In a lot of machine learning projects I’ve worked on this past year, my dataset changed several times throughout the lifetime of the experiment. When I work with biomedical data, for example, new data arrive almost weekly. And in some experiments I’ve been doing with image segmentation, I’ve been using a technique called active learning to gradually grow my training dataset.
All of this has left me with some unease that while my code is version controlled with git, my datasets and trained models are often not (due to a combination of file size constraints, and the fact that git is not ideal for binaries). Every time I retrain my models with new data, I end up either overwriting my existing model, or filling my project directory with many slightly different versions of the trained model with slightly different file names. Like this:
Model_v1.h5
Model_v2morepics.h5
Model_v3evenmorepics.h5
OK, I do try to pick better filenames. But still — this approach is unsystematic and hard to exactly reproduce. It feels unscientific. I recently learned about DVC, an open-source version control system that works in complement with git to track changes large datasets and model files (and spoiler: I am joining DVC in January 2020!). I tried it out and really like the way it adds structure to dynamic ML…