The Underdog of Machine Learning: Experiment Tracking

Archel Taneka
tiket.com
Published in
7 min readMar 8, 2023
Photo by Disney, taken from https://disney.fandom.com/wiki/Underdog_(character)

Whether you are a student, a Data Scientist, or Machine Learning Engineer in a company, doing experiments is something that is slowly carved into your DNA 🧬. Whether you are using different learning algorithms, hyperparameters, data preprocessing steps, or different training and test data, those are all part of your experiments. Now the important question is, “how do we keep track of them?” If you have some tracking mechanisms to record all of your experiments, then you have already documented your experiments well! If you don’t have it, then you should think about it as this can boost the team’s productivity and efficiency.

“It’s better to be late than never.” -My favorite procrastination quote

For me, there are multiple stages when I was trying to track my own experiments:

  1. Relying on my brain's memory capabilities, memorizing all of them including the hyperparameters and their respective results.
  2. After running 5 experiments, my memory starts fading. Hence I switch to a text editor (e.g. Word, Google Docs) or anything similar.
  3. Since I am too lazy to write and most of them are repetitive anyway, I moved to spreadsheets. I create a Google Spreadsheet document with all of the results in a form of a table.
  4. My spreadsheet grows bigger and longer and I am having a hard time understanding it. I then start looking for modern experiment tracking tools and have started using one of them since then.

Even if you are writing it down on an A4 paper, you already are doing better than anyone who does nothing to track their experiments.

What is Experiment Tracking

Well, simply put, experiment tracking is the activity to log all experiment information that you want to put your attention on. What are they? In the simplest form, they might include:

  • Hyperparameters (learning rate, optimizer, etc.)
  • Model (you might try different learning algorithms or different model architectures)
  • Metric results

One thing that I want to emphasize about experiment tracking is that it is useful even if the models are not getting deployed in the production environment. In academic settings, most research papers do not end up deployed as services. This is the complete opposite in the professional working environment where we constantly improve and deploy the latest version of our model. Thus, in the MLOps lifecycle, experiment tracking does not interrupt the model development flow.

MLOps lifecycle. Image is taken from https://ml-ops.org/content/mlops-principles

Do We Really Need It?

Within a project, you might work with multiple data scientists and machine learning engineers. Sometimes, some of them just want to do a quick run on a Jupyter Notebook. With that in mind, we might stumble upon these issues:

  1. All experiment results are spread across multiple machines. Perhaps in your team, you have a team member running the experiment on his own private laptop, PC, cloud instance, Google Colab, or even Kaggle Kernel. It will be hard to manage all of the metadata in different places.
  2. Analyzing them might become much more challenging if not impossible. The way you and your team members track something might vary. You might also forget to include something when presenting your experiments.
  3. Less collaboration and synergy within the team. Since we are working in a team, sharing what is working for you and what is working for anyone else is essential. You might also want to try various settings that your team is trying and discover why is that working.
  4. Unable to observe the experiments live. When experimenting, you and your team might want to see how are the loss and accuracy while the model is training (and probably stop the run immediately if the model is not converging to avoid spending GPU hours). More importantly, you can directly monitor current memory, CPU, and GPU (if you are using it) consumption.

These are the issues that are commonly faced when we don’t have any experiment tracking mechanism. There are probably more problems along the way if you haven’t started tracking your experiments.

Experiment Tracking Common Practice

I just started working as a Data Scientist in a professional working environment a year ago and I have put my hands on various projects related to recommendation systems, learning to rank, computer vision, NLP, etc. I notice that each field has its own unique and common components. In general, these artifacts are worth keeping track of:

  • Scripts & codes (preprocessing scripts, training and evaluation scripts, notebooks, etc.)
  • Data (both training and test data since you might use different data in between the runs)
  • Hyperparameters
  • Metrics
  • And don’t forget the environment that you’re using (e.g. dockerfile, requirements.txt, config.yaml, etc.)

Now you might ask, “what else can we track?” Depending on the type of project that you are working on, you might add multiple things to track:

  • Deep learning: model checkpoints, best model weights, hardware resources
  • Structured data: SHAP plots, feature importance
  • Hyperparameter optimization: best hyperparameter combinations
  • Computer vision: CAM (Class Activation Maps)
  • NLP: Prediction explanations

Setting Up Experiment Tracking

Not surprisingly, many people have probably done experiment tracking with these tools without them even realizing it, even yourself!

Spreadsheets

A very common workflow when working with spreadsheets:

  1. We start with a big spreadsheet.
  2. Invite collaborators.
  3. Put all of the information from what we got from our experiments into the spreadsheet (kind of like copy-paste)
  4. Rinse & repeat.

Pros:

  • Effortless to set up.
  • It works in some cases.

Cons:

  • Difficult to version (even though Google Spreadsheets has its own feature to look at the version history).
  • No artifact backup mechanism.
  • When it grows bigger (more experiments and more team members), it becomes a hassle.
  • You and your team need to come up with a naming convention on how to fill the table. For example, you come up with a model naming like: resnet16_freeze_linearlayer3_sgd_momentum_beta_0.98_lr001_acc082.pt. Great, now do that again with 5 more hyperparameters and 3 more metrics. Good luck with that 🤣
  • People can accidentally modify something in the table and they won’t even know!

GitHub

We’ll usually create a repo, then create a develop branch, and create other branches from there. Perhaps we can create one specific branch to play around with the learning rate, another branch for the preprocessing code, etc.

Github common workflow. Image by author

Pros:

  • Now we can version our metadata
  • More organized compared to spreadsheets

Cons:

  • Although we can version the metadata, comparing more than two experiments is going to be painful.
  • GitHub is not designed for experiment tracking.
  • Unable to see the experiments live.

In-house Experiment Tracking Tools

If your team is big enough, you might want to develop your own experiment tracking tools customized for your projects.

Pros:

  • Custom-tailored for the features needed for your own organization.
  • More organized compared to both spreadsheets and Github.

Cons:

  • More demanding as more features will be always added along the way.
  • Need to have a dedicated team to maintain it (continuous maintenance).

Modern Experiment Tracking Tools

Or, if you and your team don’t want to waste time developing experiment tracking tools from the ground up, you can always resort to the ones that are already available and have been used by many people.

Pros:

  • Organized and centralized repository.
  • More flexible.
  • Comparing between experiments is easier.
  • Save all of your metadata.

Cons:

  • Pricing (well… not really if you really need it to organize all of your experiments despite all problems that I mentioned earlier)

Show Time!

Enough for the theory! Now, I will show you how to use one of the modern existing experiment tracking tools (since we have seen that modern tracking tools are probably the best option to track our experiments). The tracking tool that I will be using is neptune.ai. You can also try other tools that suit you best. I believe most modern tracking tools follow these steps:

  1. Of course, you need to register an account (in this case neptune.ai). You can get a free 200 hours of monitoring each month for the individual workspace (at the time of writing this article). However, if you need a team workspace, you will be given a 14-day free access (again, at the time of writing this article) and after that, you’ll need to pay for a subscription plan.

2. Next, install the dependencies:

pip install neptune-client

3. Initiate neptune.ai monitoring run with your own project name and your API token from the neptune.ai account:

import neptune.new as neptune
run = neptune.init_run(
project="YOUR-PROJECT",
api_token="YOUR-TOKEN",
)

4. Specify what you want to track:

# log important hyperparameters
parameters = {"n_estimators": 10, "criterion": "squared_error", "max_depth": 10, "min_samples_leaf": 5}
run["parameters"] = parameters
clf = RandomForestRegressor(**parameters)
clf.fit(X, y)

scores = cross_val_score(clf, X, y, cv=5)
# log the important metrics
run["val/mse"] = scores.mean()

run.stop()

5. Run your training script as usual and the job will be submitted to your dashboard. Sit back and enjoy a cup of coffee! ☕

python your_training_script.py
neptune.ai example run. Image by author

Summary

In this article, we discussed how experiment tracking tools can leverage our workflow. There are numerous methods that we can try to start tracking our experiments. However, considering the pros and cons into account, it seems like both in-house (if you are a working data scientist or machine learning engineer in a big team) and modern experiment tracking tools are the go-to. Hopefully, you can have fun and start your own journey in tracking your own experiments after reading this article. Happy experimenting! 😁

References

--

--