How do you manage your Machine Learning Experiments?

‘Every experiment is sacred
Every experiment is great
If an experiment is wasted
God gets quite irate’ ~
Sacred

Here I come clean, for a long time I have been a caveman. I have been using spreadsheets to log my ML experiments, all started well, I was happy, and then a deadline came and all of a sudden it went messy, very messy... I trusted my self-discipline to keep consistency and it failed me. I am a waste of GPUs.

Image for post
Image for post

But don’t cast your stones yet, you all did that. I saw you when I tried to reproduce your experiments. I saw it in your papers, in the design of your experiment. So let’s pretend it never happened and find solutions.

This blog post aims to raise discussion within the ML research community and propose solutions to find natural, effective & easy to stick to solutions for managing and reproducing your research experiments.

This blog post is intended for young and senior researchers in ML NLP Vision AI ..etc, but also practitioners and hobbyists. I have had several discussions with researchers from several other fields to generate a neutral point of view, however, it might not be free from bias towards the NLP/DeepLearning communities doing empirically motivated research.

This post is NOT about workload managers such as “Slurm” and NOT about platforms for hyperparameter optimization such as “SIGOPT

Managing Evolving Research Experiments

Cheer up, there’s nothing wrong with you. This method was meant to fail.

Image for post
Image for post

At the beginning of each research experiment, you are still not sure what will be your contribution, what might go wrong and how to debug it, even sometimes what are your evaluation metrics.

The Blind stares
of a million pairs of eyes
lookin’ hard but won’t realize
That they will never see
the P.

~George Clinton & 2pac Shakur
P → (potential future modifications to your experimental setup)

This imposes the evolving nature of experiments. You have lots of knobs to change and lots of things to look at, that’s why a fixed setup for logging your experiments won’t work and hard to stick to once things change.

Knobs and watchlists

I tried to abstract inputs and outputs of each research experiments into knobs (things you continuously change or get changed automatically) and watchlists (things you observe, show to others, plot and visualize).

Knobs:

  • Code: Model architecture, Bug fixes, Evaluation Code, (Add / Fix) a Hyper-parameter
  • Datasets: Change in datasets, preprocessing, manual fixing some examples.
  • Debugging: those minor changes you always do to debug a certain model behaviour.
  • Training: Hyperparameter tuning either manually, or automatically using hyper-param opt systems.
  • Meta: experiment name, tag, time, what were you doing back then.

Watchlists:

  • Evaluation Metrics: Accuracy, ROC, BLEU, ROUGE ..etc, not only which metric you use but which implementation of those metrics.
  • Debugging and Intermediate Metrics: Training and dev loss and accuracy, Gradient per layer per epochs. System info like hostname, GPU memory %, GPU occupation %

An Ideal Solution

From this abstraction the ideal solution we are seeking here is a solution that does all the following: Code and data versioning, automatically save metadata of each run (create unique names /log folders) and save each of item in your watch list.

In a way that is:
1) Easy to setup and natural: Nobody wants to change the way they run experiments. e.g. wrapping models in docker containers might put off some people from using a specific solution. Additionally, servers setup & dealing with proxy issues are boring and time-wasting.
2) Minimal code update: I don’t need to fill my code with extra lines for reading/writing/visualizing logged metrics. Or structure my code in a completely different way.
3) Automatically & easy to stick to: No manual logging or visualization needed and hence no self-discipline is required to stick to a specific manifesto of logging.
4) Robust: Suitable for the messy clueless nature of research i.e. Doesn’t break and still provide comparability between runs if new eval metric, hyper-param, dataset, architecture is added.
5) Support whatever infrastructure: Local machine, single server, Slurm.
6) Free/In house/Open source

Does this ideal solution exist

It seems that this problem is not new in the community. There have been previous Reddit p o s t s discussing that and it was mentioned briefly in AllenNLP EMNLP2018 tutorial.

However, until this moment there isn’t any standard solution or best practice that solves this problem completely. We can see that from the number of interactions/comments on the post I created on Reddit to re-discuss this issue.

Image for post
Image for post
see full Reddit discussion http://bit.ly/2OBE7tu

I surveyed what are the current solutions and best practices in the community and collected those 20 existing solutions categorized below into four types:

SACREDStudioDatmoLoreFORGESumatraRandOptPachydermfeature ForgeModelChimpPolyAxonKubeflowWeights and BiasesOptunaML Flowcomet mlValohai NeptuneSpell

Image for post
Image for post

Showcase

I’ll here show 3 examples I picked from the table above. I will avoid talking about the first two columns: “The caveman way” for obvious reasons and “All on the cloud” as they bound you to use their infrastructure which might not be suitable for large computation. Unless you host your version locally (such as for codalab) which isn’t a smooth solution either.

1) Weights and Biases

IMO, one of the best solutions existing, it ticks many of the boxes discussed before while providing free unlimited private projects (you pay for collaborating accounts).
Here are some of its features in brief:

Easy install: WandB is installed as a python module

No drastic change in your code:
WandB provides existing wrappers for Pytorch, Tensorflow & Keras. Through calling the “wandb.watch(model)” function wandb will automatically pull all layers dimensions, gradients, model params and log them automatically to their online platform.

import wandb
...
# training loop
wandb.log(log_dict)
if __name__ == '__main__':
....
wandb.watch(model)
wandb.config.update(args)

Automatic logging:
The params you automatically log in “wandb.log()” will get automatically nicely displayed in graphs on the wandb website.

Image for post
Image for post

Additionally the “wandb.watch()” function will pull all the model specifics

Image for post
Image for post

and a nice display of a histogram of gradients per layer for each epoch (this is extremely nice for debugging)

Image for post
Image for post

and system info

Image for post
Image for post

Drawbacks “Concerns on code and data versioning”:

“WandB” automatically creates a new git branch with the name of each experiment and restores the code to the state it was in when `run $RUN_ID` was executed.

Image for post
Image for post

I see two issues with this: Firstly, this can be suitable only if you can access to the filesystem of the machine where you run the experiments and checkout the branch, however, if your infrastructure instantiates instances on the cloud that terminates after each experiment, would be impossible to retrieve this branch.

Secondly, WandB doesn’t specify a clear way of versioning your datasets

Image for post
Image for post
Source: Carey Phelps from WandB

2) Comet.ml

Comet offers pretty much the same setup and features as WandB with slight differences.

Better rich visual interface: with more features such as exporting graphs into SVG and JPEG.

Image for post
Image for post

Better code and data versioning: For code versioning Comet logs Git commit hash + Diff file containing the difference from the current code. For data versioning “If you wish you can use the experiment.log_dataset_hash() method to compute a hash of your dataset so if something changes you would know. The hash computation is performed locally and we only store the hash on Comet.”

Downsides:
Comet lacks some features
with respect to Automatic logging, for example, it doesn’t have wrappers for PyTorch and TensorFlow models that automatically log gradients, layer size and architecture.

The free tier allows you for only one private project and unlimited public projects. however, academics are eligible for free access to the paid tiers.

3) Sacred + Omniboard

Image for post
Image for post

Both WandB and Comet look neat, however, one could see many scenarios where they are not suitable. For example when you need to run your experiments on internet isolated systems, or don’t want to share code/data/server specifics for security reasons.
For that Sacred and Omni-board might be a good alternative that you can install and run locally. Sacred is the python module used to save configurations for individual runs in a MongoDB database. While Omni-board is a NodeJS server that reads from the Database and provides visualization. Both are open-source, free of charge solution however using them comes at a cost. However, Sacred is quite well known in the community and while asking around I found that many people are using it already.

Slightly more complicated setup:
Although sacred is a python module that can be easily installed. Omniboard is a NodeJS server that needs to be installed, up and running and connected to the correct Mongo DB. This might need some infrastructure work especially when you are not running and visualizing over the same machine.

A slight change in running code:
To register arguments, model and watchlists you have to modify your run file to contain specific methods decorated with Python’s pie syntax. As follows:

Image for post
Image for post

Less logging features, less pleasant UX and little (or even lack of) technical support:
It is also expected since no one is paid fulltime to support sacred nor omniboard, there are still a lot of missing features compared to what is already in WandB or Comit.
Additionally, I find the UX of Omniboard is slightly confusing (that can be a matter of taste though).
Finally, technical support is going to be limited, you can see that already from the number of open issues on both projects Sacred issues & Omniboard issues.

Epilogue:

Don’t be a caveman: while managing your ML experiments, specifying your own manifesto for logging your experiments is not the easiest way to go.

Solutions already Exist: experiment management platforms can save time and effort, useful for debugging and provide instant visualizations for communicating your ideas with the team.

Half online/Half offline platforms: Like W&B and Comet.ml are the easiest and the fastest way to get starting, yet need internet and suffer from security/privacy issues.

All in house solutions like “Sacred” Are the best offline, free of charge choice, yet not straightforward and lack some features.

Finally, be one step ahead: platforms of managing ML experiments are already used by big groups and going to be the defacto method soon and a required skill to have. Remember 11 years ago version control wasn’t an obvious choice to everybody.

Image for post
Image for post
https://stackoverflow.com/questions/1408450/why-should-i-use-version-control

Are you a caveman? Do you see any of the above solutions fit you? Let me know in the comments below or on Twitter: @hadyelsahar

Disclaimer: This blog post is not an endorsement/opposition to any specific solution and does only represent my own point of view and not necessarily my employer’s views or practices.

Written by

Research Scientist at NAVER LABS Europe. Interested in NLP and Machine Learning.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store