Intelligent Cloud — Part 1: Introduction to Lunit’s Cloud Deep Learning Platform for Efficient Model Training

Published in

Lunit Team Blog

10 min readApr 1, 2023

Introduction

In Lunit, we train a large number of deep learning models on a daily basis. This process is highly computationally intensive, requiring powerful hardware to keep up with the demands of training model. While we originally relied on on-premise servers to provide the necessary resources, these servers often encountered various issues. To better support our expanding research team, we decided to migrate our training infrastructure to the cloud. However, we soon realized that training models on the cloud required numerous manual processes, which were time-consuming and prone to errors.

To enable our team to fully leverage the advantages of cloud computing and efficiently train models at scale, we decided to develop a deep learning training platform, INtelligent CLoud (INCL). This platform has been a game-changer for our research process, providing the necessary resources and automating manual processes, allowing researchers to focus on developing new models and algorithms. In this blog post, we will delve into why we chose to migrate to the cloud and how INCL has enhanced our research capabilities.

Why did we migrate to cloud?

There were several issues with on-premise servers, but two of the most important were scalability and hardware issues.

Scalability Issues
One of the most significant issues with on-premise servers is that they can be challenging to scale. As the number of team member increases, so too does the need for more resources. Furthermore, we frequently face situations where there is a sudden surge in demand for training resources, particularly close to deadlines for product development or paper writing. In an on-premise server environment, scaling up can be difficult and expensive. It often requires purchasing new hardware, which can be costly and time-consuming. Additionally, it can be challenging to predict how much hardware will be needed for future projects, which can lead to either under or over-investing in resources.

Hardware Issues
Another issue with on-premise servers is that they often suffer from hardware problems as the servers age. This can be particularly problematic for deep learning applications, which require powerful hardware such as GPUs to perform computationally intensive tasks. Over time, the GPUs may become outdated or start to malfunction, leading to reduced performance or even hardware failure. This can result in significant downtime and maintenance costs for organizations.

Requesting for server maintenance due to hardware issues

Cloud Computing to the Rescue!
Cloud migration can solve the aforementioned issues by take advantage of powerful computing resources on-demand, without the need for significant hardware investments. Cloud providers offer access to powerful GPUs, and users only pay for what they use, eliminating the need for significant upfront hardware investments. Additionally, cloud computing provides nearly limitless scalability, making it easy to expand the resources as needed.

New challenges on the cloud environment

While the cloud offers significant advantages over on-premise servers, training a model on the cloud requires many manual processes. Typical process for training a deep learning model includes:

Creating a virtual machine
Setting up environment for training (e.g. data, code, etc.)
Training a deep learning model
Saving outputs to cloud storage
Deleting the virtual machine

While training one model might be manageable, it can involve a significant amount of manual processes when training hundreds of models. In addition, new challenges arise when training deep learning models on the cloud, such as:

Managing the results of experiments
Monitoring the experiment logs
Handling errors
Scaling infrastructure for multi-node distributed training

As manual processes on the cloud are time-consuming and error-prone, we needed a more efficient solution that could automate these nuisance tasks and provide additional features to enhance the training process.

While there exists open-source platforms that handles similar problems like Spotty, we found that they lacked some of the important features we required. For example, they typically only support cloud storage for data storage, which can be too slow for medical images and becomes the bottleneck for training. Additionally, many platforms did not support multi-node distributed learning. We also required features like GPU quota management, low GPU utilization detection, and support for team collaboration, which were not available on most platforms we tried. That’s why we developed our custom deep learning training platform, INtelligent CLoud (INCL).

INtelligent CLoud (INCL) Platform

INCL is designed to enable deep learning practitioners to make the most of the cloud computing environment and efficiently train models at scale. To achieve this goal, INCL offers various features as follows:

Automating manual processes involved in the training process on the cloud
Experiments tracking and management, making it easy to monitor and analyze results
Easy multi-node distributed learning which helps to speed up training times and enable large batch-size for self-supervised learning
Automated hyperparameter optimization, reducing the need for manual tuning and increasing the efficiency of training

Let’s take a closer look at each feature!

Automating manual processes
INCL is designed to make training deep learning models on the cloud as effortless as possible. By automating the manual processes involved in training, researchers can focus on developing new models and algorithms. So how can we run an experiment through INCL? All you need to do is specify the script that runs your experiment, the docker image and instance information, and INCL takes care of the rest. This can be done easily like the example shown below.

cd ${PATH_TO_YOUR_EXPERIMENT_REPO}

incl run --script "python cifar.py --lr 0.1" \ # Run script
         --name cifar100-lr-0.1 \  # Experiment name
         --docker-image pytorch/pytorch:1.12.0 \ # Docker image
         --gpu-type t4 --gpu-num 4 \    # Type and number of GPU
         --machine-type n1-standard-32 \    # Instance type
         --disk-size 20 \   # Disk size (GB) required for experiment

INCL uploads the code to the current working directory, spins up the instance, sets up the environment, and runs the experiments without requiring any manual intervention. Moreover, it handles various errors that could occur during this process, freeing up time and resources for researchers to focus on more important tasks.

Experiments tracking and management
Experiment tracking and management are essential when training a model on the cloud, especially when working on complex deep learning projects. Without a proper system in place, it can be challenging to keep track of the experiments, manage them efficiently, and monitor their progress. INCL understands this need and provides a comprehensive experiment tracking and management system that is accessible through both the command-line interface (CLI) and web UI.

Detailed information, log, metric graph and file of an experiment are easily manageable in INCL

INCL’s experiment management system provides users with a clear worklist of their experiments, including their status, configuration, and other important details. It also allows users to monitor the experiment logs, and the system automatically saves the results of the experiments. Users can download the results at any time, making it easy to share them with others. INCL also provides real-time updates on the experiment’s progress and sends notifications when the status changes. The system also supports Slack notifications to keep users informed about the experiments’ status and progress.

INCL worklist allows user to easily manage their experiments

INCL’s experiment tracking system includes visualization of live metrics, making it easy for researchers to track the model’s performance during the training process. This feature provides the ability to monitor various performance metrics, such as loss, accuracy, and other custom metrics, making it easy to assess the model’s performance and make necessary adjustments.

Various experiments can be easily compared using the INCL analyzer

Finally, INCL’s experiment tracking system includes a model registry feature, making it easy to manage models throughout their lifecycle. The system allows researchers to register models, add metadata, and keep track of versions, making it easy to manage and deploy models in production. All these features make INCL an ideal platform for deep learning researchers looking to streamline their work and improve their productivity.

Multi-node distributed learning
Multi-node distributed learning is a powerful technique for accelerating deep learning training times and is particularly useful when dealing with large datasets or complex models. By distributing the computation across multiple nodes, researchers can take advantage of parallel processing to speed up training times and tackle larger problems. Moreover, in recent years, self-supervised learning has emerged as a promising approach for training deep learning models. It requires training on large amounts of data and a large batch size is often necessary for good performance. Multi-node distributed learning is an essential tool for enabling this type of training, as it allows researchers to scale up the batch size and process large amounts of data efficiently.

Distributed learning accelerates deep learning training times (from arxiv)

INCL provides a seamless way to use multi-node distributed learning, making it easier for researchers to focus on developing new models and algorithms rather than worrying about the technical details of training on the cloud. All you need to do is specify the number of nodes you want to use to scale your experiments. INCL automatically provisions and sets up the environment required for the distributed learning, as well as sets the necessary environment variables. You can simply pass these variables to your running script, and INCL will handle the rest. This means that you can easily run large-scale distributed learning experiments with a simple commands as below.

incl run --node-num 4 \       
         --name multi-node-experiment \
         --script "python -m torch.distributed.launch  \
                          --nnodes=$NUM_NODES  \
                          --nproc_per_node=$NUM_GPU_PER_NODE  \
                          --node_rank=$NODE_RANK \
                          --master_addr=$MASTER_ADDR \
                          --master_port=$MASTER_PORT \
                          train.py"

Automated hyperparameter optimization
Automated hyperparameter optimization (HPO) is crucial in reducing the manual work of researchers in training deep learning models. Traditionally, hyperparameters need to be manually tuned by researchers, which can be a time-consuming and labor-intensive task. With automated HPO, INCL automatically searches through a range of hyperparameters to find the optimal configuration, freeing up researchers’ time and resources to focus on developing new models and algorithms. By automating HPO, INCL maximizes the efficiency of training deep learning models on the cloud, making it easier for researchers to achieve state-of-the-art results.

Multiple charts are provided to enable meaningful insights into hyperparameter optimization

Among various HPO algorithms, INCL’s HPO module is built on top of the Tree-structured Parzen Estimator (TPE) algorithm, which has proven to be highly effective. However, INCL further improves upon the original TPE algorithm to boost the performance which will be further introduced in another post. INCL allows for the parallelization of hyperparameter optimization to reduce the time it takes to find the optimal set of hyperparameters. This is achieved by distributing the evaluation of hyperparameters across multiple nodes, which significantly speeds up the process compared to running the evaluations sequentially

INCL also allows researchers to easily specify the hyperparameter space, providing flexibility in the optimization process. All you need to do is specifying the hyperparameter space as an example below. INCL will inject each suggested hyperparameter as environment variable with the name you specified. To utilize INCL’s HPO feature, only a minor code modification is required. Researchers need to log the metric they want to optimize with the specified name in the configuration file by using the incl.log method. This logs the metric value during the experiment, which INCL uses to search for the optimal hyperparameters. With this straightforward modification, researchers can easily leverage the power of automated HPO without needing to write complex code or set up additional infrastructure.

hpo_meta:
  name: cifar-hpo
  num_total_trial: 128
  num_parallel: 8
  metric:
    name: loss
    goal: minimize
  search_space:
  - name: lr
    distribution: log_uniform
    min: 1e-5
    max: 1e-1
  - name: optim
    distribution: categorical
    values:
    - adam
    - sgd

job_meta:
  script: "python cifar.py --learning-rate=${lr} --optimizer=${optim}"
  gpu_type: t4
  gpu_num: 1
  machine_type: n1-standard-4

How INCL Improved Our Research Process

INCL has revolutionized the research process for our team. With INCL’s streamlined process for running, tracking, and managing experiments, every researchers can quickly and easily set up and run experiments. This has greatly reduced the amount of time and resources required for managing the infrastructure and allowed researchers to focus more on developing new models and algorithms.

When approaching deadlines for product development, INCL has easily scaled up infrastructure to hundreds of GPUs to accelerate the training process. The easy-to-use multi-node distributed learning feature has also been actively explored. This feature has enabled us to train models on larger datasets and with larger batch sizes, which has led to significant improvements in performance. Automated hyperparameter optimization with INCL has also been a game-changer for our team. By using INCL’s HPO feature, we have been able to achieve state-of-the-art performance on several products with minimal manual effort.

Overall, INCL has transformed the way our team conducts research by providing an easy-to-use and powerful platform for managing infrastructure, running experiments, and optimizing hyperparameters. It has enabled us to be more productive and achieve better results in less time.

Conclusions

Migrating to the cloud has been a critical decision for our research team, allowing us to address the challenges of scaling our training infrastructure. We have found that cloud computing provides us with the necessary computational resources and flexibility to support our expanding research team. Moreover, our development of INCL has further enhanced our capabilities, providing an easy-to-use and efficient platform for managing infrastructure, running experiments, and optimizing hyperparameters. There are still many areas where we could improve deep learning process in Lunit. If you’re passionate about building and optimizing deep learning system and want to work on cutting-edge technology that’s making a big impact in the industry, consider joining Lunit’s team!

Explore the full spectrum of our Intelligent Cloud (INCL) series. You can easily navigate through the entire series here: