Managing your machine learning experiments and making them repeatable in TensorFlow, PyTorch, Scikit-Learn, or any framework in Python
I participate in Kaggle (machine learning competitions) for fun and I have literally run thousands of experiments over a number of competitions I’ve participated in.
Getting your models to perform really well, especially better than hundreds (or thousands!) of other teams that you are competing with is no easy task. The process is very iterative. You run a bunch of experiments, changing the underlying models, trying different optimizers and tuning hyperparameters, changing your sampling methods, changing the way you create splits, trying different libraries or different versions of libraries, changing your random seeds... You may also be running multiple experiments on multiple GPUs/multiple hosts to parallelize the process and each might have different hardware spec / OS.
If you don’t have a system for managing your experiments, it quickly gets out of control. Sometimes you run into situations where you cannot reproduce the great local CV or public LB score that you achieved since you don’t know what you changed since then! Arrrrggghhh!!! You forget why some experiments worked better than others. You forgot what worked and why, and equally important, what didn’t work and why.
Hopefully you are using version control, such as Git, and that helps you keep track of code changes, but are you really disciplined enough to commit every single change you make? — a lot of times you want to just change your hardcoded hyperparameters and re-run the experiment. Or sometimes you parameterize arguments to your script so version control does not help you keep track of what was used at runtime, unless you explicitly captured them in some other ways. Other times, you use a different version of some library because you need new features/bug fixes. Version control does not help there either, unless you are disciplined about also checking in all the dependencies and their versions. You dump a whole bunch of useful information in the console output as the models are training/evaluating. Are you keeping track of all of that for every single run?
If you are running your machine learning experiments in Python, there’s something that lets you do all of that very easily, with minimal changes to your code. It is appropriately called… SACRED!
Definitely check it out. It is super simple to integrate (just pip install add a few lines to your code). The documentation is awesome. How does Sacred help you? Sacred dumps all your experiment results into MongoDB, which is a scalable JSON database. It literally dumps and saves everything… including:
- Metrics (training / validation loss, accuracy, or anything else you want to track)
- Console output (stdout)
- All the source code you executed (basically dumps a copy of your source code into MongoDB)
- All the imported libraries and their versions used at runtime
- All your config parameters that you expose through the command line
- Hardware spec of your host / GPUs
- Random seeds
and much more. Everything you need to revisit and examine or repeat your experiments later as needed.
Also, it has nice integrations with Slack / Telegram, etc. I have it set up so that I get a Telegram instant message when my experiments complete / fail.
MongoDB does not come with Sacred, but you can easily run MongoDB in Docker container while storing the data on the host. To visualize the captured data from your web browser, you can use a Sacred viewer, such as sacredboard. Or you can view / manipulate the stored data via mongo-express.
Sacredboard works fine, but it has some major limitations as of this writing. For example, the columns that show up in the data table of your experiments are hardcoded and they cannot be modified (unless you change multiple files in the source code — for each Kaggle competition, I had been creating a custom version of sacredboard to expose the right hyperparameters that I wanted to see in the table.) Also it lacks the ability to add notes about your experiments. It also does not let you tag and filter your experiments. It does not let you see the best epoch, best validation loss, in the table unless you expand the row to view experiment details. Due to these limitations, a friend and I created our own version of Sacred viewer, which overcomes these limitations and more (basically built what we needed to meet our needs.)
UPDATE on September 5, 2018: We are excited that we just released Omniboard, a new frontend for Sacred. It was inspired by Sacredboard with enhancements mentioned above. Special thanks to Vivek Ratnavel who built it from scratch using React + Express. Omniboard can be easily installed via “npm” or be run as a Docker container. Instructions (and code) are here: https://github.com/vivekratnavel/omniboard