Manage ML experiments using ClearML

5 min readSep 23, 2023

Hi, everyone! In this article I told about what is experiment manager in Machine Learning. Why we need to use it and what should it contain. It was in example of VertaAI manager. In this post we’ll discuss one of the most powerful tools for storing, tracking and reproducing ML experiments — Clear ML. It has excellent documentation, nice video guides, a lot of different useful functions and convenient API.

ClearML | Experiment | Start Free

Use ClearML Experiment as your single source of truth for any experiment at any point in time from any machine. Learn…

clear.ml

In first section we will learn base features with python examples. In second section I will tell you about how to put experiments in queue and run them directly from UI (not from console). Full code for this post is available in my repo.

Base methods

I use Clear ML that is running locally. At first, open UI and sign in (pic.1). And after that get credentials for python API. You can create new credentials in Settings. Copy following code (pic.2) to paste it further.

After that let’s open your environment to run commands below.

pip install clearml
cleaml-init

In case you faced with an error after running last command, check your home directory and try again. If it doesn’t help, download this clearml.conf, update credentials and put it in home dir.

You will see something like this.

It means that from now you can use ClearML with python. So, let’s do this!

Create your first experiment “train-experiment-001” in your project “Car Classification” and add some initial info like tags, config, hyperparameters.

Task.set_credentials(
    key='XY77TG37D416FBTF282R', 
    secret='dYFLBRj7hK3sOkEFp2ivL7Bf1gAML9MLX0uKclMaVlcDWAUc89'
)
task = Task.init(project_name='Car Classification', 
                 task_name='train-experiment-001', 
                 task_type=TaskTypes.training,
                 auto_connect_frameworks=False)
logger = Logger.current_logger()

task.add_tags(['example'])

hparams = {
    'learning_rate': 0.001
}
task.connect(hparams, name='hparams')

During train process we compute loss and set of metrics periodically. It’s able to combine several lines in one plot. In my case there are two plots: first plot is Loss with only CrossEntropy line and second plot with Metrics contains Precision, Recall, Accuracy.

Please, don’t pay attention to the view of plots because it is just an example.

In the end of all epochs you can log final metrics as result of your experiment. It’s displayed in the table on the top (pic. 3).

Additionally, for our classification task ClearML allows us to store Precision-Recall and ROC curves as well as confusion matrix (pic.4). Moreover, it might be useful to save bad samples (images in our case) that fail prediction during tests. It’s called debug samples. I highly recommend not to save a lot of big media files because during your research there might be hundreds of experiments and memory will be extended.

One of the big advantages of ClearML is that essential part of information is logged automatically: args, stdout, model attributes, resource consumption, etc. Some of them can be disabled in Task.init().

Finally, we have all information about our experiment. And we can reproduce it and compare with others. In single table we see all experiments with status, metrics, hyperparameters.

Workers and Queues

Using ClearML Agent we can run experiments in queue. Worker is assigned to the queue and run experiments in it automatically — one by one. Worker can use one or more GPU(s). Docker image can be specified in order to run train/test script from docker image. Please, see this doc about how to create workers and queues using ClearML Agent in details.

Secondly, we need some experiment in status draft. One way to do it is to clone existed experiment.

After that it will be able to edit a lot of things like args, hyperparams or config right in UI. For example, we intend to reproduce training with another version of data, that can be changed in args.

The next step is to put experiment described above in particular queue.

If this queue is checked by some worker and this worker is active, it will run experiment automatically.

Why do we need this opportunity?

Imagine that we have multiple hyperparameters, data versions, configs. We can change them and run each experiment manually (for example from python script). But another way is to create drafts. Change settings in each draft right in UI. And put drafts to queue. Worker will run them sequentially. If you want experiments to work in parallel, create several queues and workers for each of them.

Conclusion

ClearML is really powerful tool for managing ML experiments. Try to integrate it into your project and I hope you gonna like it :)