Introduction to ClearML — Executing in Google Colab

9 min readDec 1, 2023

ClearML is an open source MLOps platform which provides useful mechanisms. The following demonstrates how to practice ClearML in Google Colab; the example I use is training MNIST dataset with Pytorch.

I also publish partial code and files in GitHub.

0. Basic Concepts

The structure of ClearML (The source of icons are from the ClearML Web UI)

This graph depicts the relationship between some main functions of ClearML. One account can create multiple credentials, each of which contains five arguments. Once a credential is passed to local machine, we can use it to build multiple experiments under the projects we assigned. Besides, ClearML let us execute the experiments remotely by creating corresponding agents. An agent provides the tasks’ execution environment with virtual environment or a docker container; one agent could run a specific experiment or a queue of experiments. ClearML records and visualizes each experiment for eight aspects.

1. Create an Account and Obtain Credentials

First of all, we can sign up a ClearML account for free by creating one with our email, Google, Bitbucket, or Github. After logging in, we click the icon in the upper right corner of Web UI and enter the setting page.

In the ‘Workspace’ option in the sidebar, we can create new credentials. The popup has unique information including access key, secret key, web server, api server, and files server. There are two tabs, local Python and Jupyter Notebook, to choose from; we can copy the code below according to the way of executing experiments. Note that the secret key should be copied since it only shows up once. (The web server, api server, and files server are fixed; the access key will be shown in the Web UI.)

This article will focus on executing experiemnts in Google Colab, which is similar to executing in Jupyter Notebook. In Google Colab/ Jupyter Notebook, we can connect the experiment by either setting the environmental variables in the lower right, or sending info to cleaml.Task object (explained in next section 2.0).

Credential info for local Python (left) and Jupyter Notebook (right)

2. Executing Experiments in Google Colab — Pytorch MNIST

Two main ways are designed to execute experiments. One is executing in specific local environment which has our required packages. It is an intuitive way because only two more lines of codes are needed in our current script; however, it requires nonconflicting packages already installed in the environment. The other way of execution creates an agent, which builds a virtual environment or runs a docker container according to the docker image we specifies.

The documentation provides some sample codes of different models (ref). Here we will take, Pytorch with MNIST dataset, for example (script ref).

2.0 Set up

The following begins coding in Google Colab. Remember that we are supposed to set the hardware accelerator by selecting ‘Runtime’ > ‘Change runtime type’ at the very first step.

Section 2.1 and 2.2 will introduce two ways of executing experiments; their complete codes are collected respectively at the end of the sections. For both methods, the first thing to do is install clearml package and import main modules. We can pip install directly in a code cell by adding exclamation mark in front of the command.

!pip install clearml
from clearml import Task, Logger

Next, we need to use one credential to record this experiment. Besides setting the environmental variables, we can assign the credential to a Task object.

# Method 1) Set the environmental variables

%env CLEARML_WEB_HOST=https://app.clear.ml
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml
%env CLEARML_API_ACCESS_KEY= <your access key>
%env CLEARML_API_SECRET_KEY= <your secret key>

# Method 2) Specify in Task object

Task.set_credentials(
     api_host="https://api.clear.ml",
     web_host="https://app.clear.ml",
     files_host="https://files.clear.ml",
     key='<your access key>',
     secret='<your secret key>'
)

2.1 Execution in local environment

For our Pytorch model, we need to install torch and torchvision.

!pip install torch torchvision

After we initiate a task, the code executing below this line will be recorded by ClearML. The output returns the url of ClearML results page. Check out the result we updated to ClearML from the link or go to the result page from Web UI (Click ‘Projects’ on the sidebar > Click your project > Right-click on your experiment for details).

task = Task.init(project_name='<your project name>', task_name='<your task name>')
# Start the experiment execution here...

All outputs returned after we run ‘Task.init()’ will be recorded in the result page. For example, CONSOLE records the outputs of code cells, and PLOTS records your output plots.

Nevertheless, it needs manual settings if we want to record some specific information from experiments.

Note that the the hyperparameters and scalars regarding model evaluation, such as training loss or test accuracy, will not be recorded automatically. About the hyperparameters, we can manually record them by task.connect() method

# e.g. Save 'max_iter' and 'random_state' in CONFIGURATION > HYPERPARAMETERS > General.
task.connect({"max_iter":20,"random_state":102})

or make use of the argparse.ArgumentParser(). As for the evaluation scalars, we need to manually add Logger.current_logger().report_scalar() in every unit we would like to report. The script below demonstrates how to report hyperparamters and evaluation scalars in each batch. The script is edited from official script; I commented out ‘from clearml import Task’ and ‘task = Task.init(…)’ since I have already initialized the task outside the script. Here I name the script ‘torch_mnist.py’.

# -----------------file name: torch_mnist.py-------------------------

from __future__ import print_function

import argparse
import os
from tempfile import gettempdir

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

# from clearml import Task, Logger
from clearml import Logger

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4 * 4 * 50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4 * 4 * 50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            Logger.current_logger().report_scalar(
                "train", "loss", iteration=(epoch * len(train_loader) + batch_idx), value=loss.item())
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss.item()))


def test(args, model, device, test_loader, epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    Logger.current_logger().report_scalar(
        "test", "loss", iteration=epoch, value=test_loss)
    Logger.current_logger().report_scalar(
        "test", "accuracy", iteration=epoch, value=(correct / len(test_loader.dataset)))
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main():
    # task = Task.init(project_name='examples', task_name='PyTorch MNIST script')

    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=10, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                        help='learning rate (default: 0.01)')
    parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                        help='SGD momentum (default: 0.5)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')

    parser.add_argument('--save-model', action='store_true', default=True,
                        help='For Saving the current Model')
    args = parser.parse_known_args()[0]
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)

    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'num_workers': 2, 'pin_memory': True} if use_cuda else {}
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST(os.path.join('..', 'data'), train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST(os.path.join('..', 'data'), train=False, transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])),
        batch_size=args.test_batch_size, shuffle=True, **kwargs)

    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(args, model, device, test_loader, epoch)

    if args.save_model:
        torch.save(model.state_dict(), os.path.join(gettempdir(), "mnist_cnn.pt"))
        
if __name__ == '__main__':
    main()

After dragging torch_mnist.py into current directory of Google Colab, we need one more line of code to execute it.

!python torch_script.py  
# We can use args to set hyperparameters. For example,
# !python torch_script.py --epochs 3 --batch-size 128

Now the hyperparameters and evaluation scalars are shown in the result page in the tab of CONFIGURATION and SCALARS, respectively.

Complete code for section 2.1 is collected below.

# package installation
!pip install clearml torch torchvision

# experiment setting
from clearml import Task, Logger
Task.set_credentials(
     api_host="https://api.clear.ml",
     web_host="https://app.clear.ml",
     files_host="https://files.clear.ml",
     key='<your access key>',
     secret='<your secret key>'
)
task = Task.init(project_name='<your project name>', task_name='<your task name>')

# Start the experiment execution here...
!python torch_script.py # (Make sure that torch_script.py is under current directory)

2.2 Remote execution

One of the vital concepts of ClearML is the status of experiments, which relates to remote execution. Remote execution means using an agent to run an ‘Draft’ task or a queue of ‘Pending’ tasks.

Process of changing the status of an experiment

A draft task comes from (a) cloning, (b) reseting, or (c) Task.create(). Note that cloning creates a new task based on the parent task while reseting reuses the same task. Both of cloning and reseting can be done intuitively in Web UI (right-click on the task) or simple codes (Task.clone() and Task.reset()). The following gives an example of Task.create() to demonstrate the flow from creating a draft to running the experiment.

2.2.1 Agent setup

Install and initialize an agent.

!pip install clearml-agent
!clearml-agent init # interactive mode

Answer multiple input boxes step-by-step. These questions verify the credentials and ask for some git information which can be skipped. The marked lines in the picture are the questions, and the notes next to them are recommended answer (‘Enter’ means leaving blank is acceptable for this question).

The initialization of ClearML agent stores a configuration file called clearml.conf at /root. If we build agent in local machine, the path of clearml.conf changes according to the operating system (ref). Some settings can be modified in clearml.conf directly, such as reuse of virtual environment and name of an agent.

2.2.2 Create and enqueue a task

For remote execution, preparing requirements file helps install required packages. Continuing to use the script ‘torch_script.py’ mentioned in section 2.1 and pulling these files in current directory, we run codes below to create and enqueue a task. The example, when creating the task,modifies the default hyperparameters, epochs and batch_size, by passing ‘argparse_args’ a list of tuple pairs.

Files in current directory (left) and the content of requirements.txt (right)

# Creation
task = Task.create(project_name='<your project name>',
                   task_name='<your task name>', 
                   script='torch_script.py',  # execution script
                   requirements_file='requirements.txt',  # required packages
                   argparse_args=[("epochs",2),("batch_size",128)]  # args modification
)

# Queuing
Task.enqueue(task,queue_name = "default") # designate the queue where the task is waiting

We can replace these codes with one command line:

!clearml-task --project <your project name>\
              --name <your task name>\
              --script torch_script.py \
              --args epochs=2 batch_size=128\
              --requirements requirements.txt\
              --queue default

Now the task is in the pending status.

2.2.3 Run experiments by agent

An agent builds a virtual environment or runs a docker container to execute the experiments. Since the docker mode requires some conditions and can be overwritten in section 2.2.2, the code below demonstrates the creation of virtual environment.

# Activate the agent to execute the tasks waiting in the queue, default.
!clearml-agent daemon --queue default

Now the task is running.

Note that the agent will not stop running even when the tasks are completed. That is, we need to interrupt the code cell of activating agent manually if we want to execute other cells in Colab.

Complete code for section 2.2 is collected below.

# package installation
!pip install clearml clearml-agent

# experiment setting
from clearml import Task, Logger
Task.set_credentials(
     api_host="https://api.clear.ml",
     web_host="https://app.clear.ml",
     files_host="https://files.clear.ml",
     key='<your access key>',
     secret='<your secret key>'
)
task = Task.create(project_name='<your project name>',
                   task_name='<your task name>', 
                   script='<script path>',  
                   requirements_file='<requirements file path>',  
                   argparse_args= <the list of tuple pairs to edit hyperparameters>
)

# Queuing
Task.enqueue(task,queue_name = "<queue name>") 

# Initializing and activating the agent
!clearml-agent init
!clearml-agent daemon --queue <queue name>