Managing experiment metadata “like a MLOps” — Neptune

Published in

ResponsibleML

5 min readAug 12, 2022

One day comes the great news: the data is available for training. Now, it is the time to start training. We have the goals we want to meet and the first model architectures. And the question arises: How do we compare results? That’s why we decided to use neptune.ai.

This blog is the fifth in our xLungs series about the Responsible Artificial Intelligence for Lung Diseases project. You can check out our previous blogs here.

Use case

xLungs project, carried out at MI2DataLab at Warsaw University of Technology, aims to develop a universal tool for radiologists in order to help them in medical image diagnosis. We decided to use state-of-the-art architectures with one of the biggest chest CT datasets. Our goal is to develop robust and reliable models to perform such tasks like tissue segmentation and patient diagnosis via classification based on carefully defined ontologies. Fulfilling such tasks requires a lot of experimentation and very fast we found it hard to compare our results.

A common approach is to use a Tensorboard for experiment tracking. However, when experiments are spread over time, it is easy to get lost in the produced logs. After a couple of weeks, it can be a nightmare to find the best model and compare its results with others. That’s why we needed a tool that will help us to log experiments, and allow us to easily track changes between different tasks. Also, it should allow for easy comparison of results from the same task.

We bumped into Neptune. At first, it seemed like a simple logger but soon we discovered how powerful it is. Why did we decide to use it? Read on!

Runs table and logging

It seems like something very simple but it can generate many troubles. So far we used mostly Tensorboard as a vanilla tracker. However, we didn’t define a common way to store our experiments and didn’t set up rules for that. And probably you can imagine it would be a lot of work to set up rules and organize log files in such a way that we would be able to find experiments that we would like to compare. Also, we would have to keep a separate database for specifying links between logs and experiments metadata. And I could give examples on and on because the list of issues is long. And we don’t have time for it: we are here to develop models not tracking tools.

Neptune comes with perfect solutions. All results are stored on the server (by default it’s on a cloud but in our case, we host it locally). The experiments are split into different tasks (called projects, like CT segmentation) and listed in the runs table. Each run is automatically added to the list whenever it has started. There we can see all metadata, metrics, even sample images uploaded to the training.

Comparing runs

OK, but so far it doesn’t look like we have solved our problem. How do we compare runs and find the best model for a given solution? Having all metadata in one place is helpful but how do we track runs metrics over training? Usually, we would find Tensorboard logs and try to plot them but with the growing number of experiments and people in the project, it can become tedious work to compare them later.

With Neptune we can easily compare all of the runs. It allows live tracking during the run, but also we can plot experiments with the same metrics. For filtering, we can use metrics (like top lowest AUC) but also another simple yet powerful feature: tags. Tagging is a real game changer because it allows easy experiment separation. It is very helpful when there are many different approaches to the same task and we can easily filter our results. Later, at the final stages, we can compare our approaches with a custom dashboard.

API

Well then it seems like a very nice tool for tracking experiments but how to use it? Tensorboard comes with a very simple API and we can use it across different libraries. You just need to create either a callback that will log everything automatically or you create a writer object and manually decide where to log data.

Well in the case of Neptune it is, well… the same. The API is very intuitive and simple. To start logging you create a run that is connected to a specific project. Then most of the parameters are passed as dictionaries to this object, which is very helpful in organizing them. During training, you can either create a Neptune callback or manually decide where and what to log. The API itself is very simple and intuitive so integration is not very demanding.

Future tools

However, Neptune is still on the hunt! Developers keep announcing new features. One of the most important is model registry: it allows to track changes in different architectures and keeps track of the best model. It is very simple to use in your project but also it is vital at later stages: it is very important for selecting models at the production stage.

Another feature is a dataset versioning. As our project is still gathering new data sources, it is important to track when they have changed. It is extremely important because new data creates a shift that might have a strong influence on the model, so even the same architectures may differ in performance. Dataset versioning helps you to reliably compare experiments.

Conclusion

Neptune at first may look like a simple tracking tool. It gives you and your team a stable basis for experiment tracking. In our case, we want to create reliable Deep Learning models. In order to do so, we need a tool that would help us to track our progress, compare experiment results and, most importantly, don’t distract us from our main task. Neptune provided all of that at once and we are glad that we can use it.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.