Intelligent Cloud — Part 2: A Deep Dive into the Architecture and Technical Details of INCL

HyunJae Lee
Lunit Team Blog
Published in
10 min readApr 2, 2023

Introduction

In our previous blog post, we discussed how INCL has enhanced our research capabilities. In this post, we’ll dive into the architecture of INCL and explain how it works.

INCL has been designed with several key components to ensure scalability, high performance, and a user-friendly experience. It provides a distributed and fault-tolerant training environment that can handle large number of experiments at the same time. Moreover, INCL automates nuisance tasks with minimal code modification and configuration required. Our focus is to provide a flexible, efficient, and user-friendly platform for deep learning research on the cloud. In the following sections, we will explore the technical details of each component and how they work together to provide a seamless and scalable deep learning training environment.

Workflow of Running an Experiment in INCL

INCL consists of three main components: INCL client, INCL backend, and job instance. INCL client communicates with INCL backend to run, manage and track experiments, while the backend handles requests from clients and manages the resources needed to run experiments. Each experiment runs on a job instance, which is a virtual machine provisioned by the backend.

Note that the term “experiment” is used interchangeably with “job” throughout this post. It’s worth noting that a job in INCL is not limited to training a model; it can also include running an inference for a trained model or any other script that can be run on an on-premise machine. This kind of feature enhances the flexibility of INCL and allows researchers to easily adapt the platform to their specific needs.

Before diving into the technical details of INCL’s architecture, it’s important to understand the workflow of running an experiment. This high-level overview provides an insight into the underlying architecture and how the different components work together to provide a seamless and scalable environment. A general workflow of running an experiment in INCL is as follows:

A general workflow of running an experiment in INCL
  1. The client requests a job run to the INCL backend. An user requests a job run, specifying the script and the required resources.
  2. Experiment code is uploaded to object storage. Code in the working directory is synced to object storage to be run on the cloud.
  3. The backend provisions a virtual machine and sets up the environment. Based on the resource requirements specified in the job request, the backend provisions a job instance and sets up the environment by installing the required software packages and dependencies.
  4. The job instance downloads the experiment code from the object storage. Once the job instance is ready, it syncs the experiment code from the object storage to block storage which is high-performance storage.
  5. The model is trained on the job instance. With the experiment code and input data in place, the job instance starts training the deep learning model. The training and validation data is stored in file storage storage which is mounted to the job instance as a network file system.
  6. After the training is finished, experiment results are synced to object storage. Once the training is completed, the output data (such as trained model checkpoints) are uploaded to the object storage for further analysis or deployment.
  7. The job instance reports termination to the backend. Finally, the virtual machine reports back to the backend that the job has completed, and the backend deallocates the resources and schedule a new job.

The components of INCL work together seamlessly to create an efficient and effective cloud-based training environment. In the next section, we’ll examine each component in detail and explore how they interact with each other to provide a comprehensive view of INCL’s architecture.

Key components of the INCL architecture

INCL Client
INCL client is the interface that users interact with to run, monitor, and track their experiments. It provides two interfaces: a command-line interface (CLI) and a web user interface (UI). The CLI is built using Click, a Python package for creating command-line interfaces. It allows users to easily interact with the INCL backend from the command line, such as submitting jobs, checking job status, and retrieving results.

Example command-line interface (CLI) usage in INCL

On the other hand, the web UI is built using React, MobX, and Material UI, which provide a modern and intuitive graphical user interface. Users can access the INCL web UI through a web browser and monitor, track and manage their jobs in real-time through rich visualizations.

Analyzing experiments through rich visualizations in INCL web UI

Having both a CLI and a web UI in INCL offers users flexibility in how they interact with the platform. The web UI provides a user-friendly interface for visualizing training progress and results through metric graphs, allowing users to easily monitor their experiments. On the other hand, the CLI provides a command line interface that can be used for scripting and automation, allowing users to handle INCL on a terminal and easily integrate it into their existing workflows. This combination of a web UI and a CLI allows users to choose the most convenient way to interact with INCL based on their specific use case and preferences.

Interaction between INCL client and backend

The communication between the INCL client and backend is facilitated through a set of RESTful APIs exposed by the API and logging server. The API server is responsible for handling requests from the client, which includes job submission, status queries, and model registry. This server is connected to an object storage system that handles the storage of code and models, and a relational database that stores various metadata. On the other hand, logging server is responsible for handling log and metric data in real-time. It is connected to a NoSQL database that stores logs and metrics generated during the course of the experiment. In the following section, we will delve into the specific components of the INCL backend and their respective functionalities.

INCL Backend

The INCL backend, which runs on the Kubernetes cluster, serves as the backbone of the platform, responsible for managing and executing the various tasks associated with deep learning experiments. It is comprised of several key components, such as the API server, scheduler, job runner, and logging server, that work together seamlessly to provide an efficient and effective training environment.

API server is responsible for handling requests from the client, which includes job submission, status queries, and model registry. The API server is built on top of a Django REST framework and is connected to an object storage system that handles the storage of code and models, and a relational database that stores various metadata. It is responsible for tasks such as authenticating and authorizing users, serving model artifacts for jobs, storing and retrieving job metadata, handling status queries from the client, and serving APIs for the client.

Scheduler is responsible for managing the job queue to ensure that jobs are run in the correct order. In addition, it manages both global and user resource quotas to distribute resources fairly and prevent monopolization. The scheduler also determines whether to run a job on an on-demand or spot instance based on resource usage and preemption patterns. Maximizing the utilization of spot instances can lead to significant budget savings, which we’ll discuss in more detail in an upcoming blog post.

Job runner is responsible for executing the job on a virtual machine and ensuring that it runs smoothly to produce the expected results. It uses Celery worker to execute tasks asynchronously and handle a large number of jobs efficiently. Job runner is designed to be highly scalable and fault-tolerant. It performs various tasks, including provisioning a virtual machine, copying code and model artifacts, installing dependencies, and setting up the environment. Additionally, it monitors the progress of the job and handles any errors that may occur during execution to ensure the job completes successfully.

Interaction between API server, scheduler, job runner and job instance

The API server, scheduler, and job runner interact in a tightly integrated system for managing jobs. When a user initiates a new job, the API server adds the job to the job queue. Scheduler selects jobs from the queue based on resource usage and quotas, and sends the selected job to job runner, which runs the job on a job instance. After the job is finished, the job instance reports its termination to the job runner, which requests to schedule a new job since allocated resources have been freed. If a client requests to stop a running job, the API server directly requests the job runner to stop the job. The job runner gracefully requests to stop the job to the job instance and syncs the results. Throughout the job execution, the backend manages the job’s lifecycle and informs the user of any status changes.

A lifecycle of an INCL experiment

Logging server is responsible for handling log and metric data in real-time. Training a deep learning model can include heavy metric write requests, with one model potentially requiring multiple metric writes per second. Given that our application could have near a hundred of jobs running simultaneously, this could result in several hundreds to thousands write requests per second. To handle this large amount of log writes, we developed a designated logging server with Gin. By using a lightweight and highly efficient framework like Gin, we were able to create a logging server that is optimized for high concurrency and low latency, allowing it to handle large volumes of logs and metrics with ease. Job instances write logs and metrics to the logging server during experiments in real-time, which can then be easily accessed by users to track and analyze their models. By offloading the task of logging from the main API server developed with Django, we were able to improve the performance, scalability, and maintainability of our application.

Job Instance
A job instance is a virtual machine that runs the job that the user has requested. It includes a job executor, the INCL Python SDK, a metric collector, and a log collector. A job executor runs the job in a Docker image that the user has specified. INCL Python SDK can be used within the job instance to log configuration details, experiment metrics, and model artifacts. With a few lines of code, it allows user to easily analyze and reproduce experiments in the future. For example:

import incl

config = {
'dataset': 'cifar10',
'hyperparamter': {
'lr': 0.001,
'epochs': 20,
},
}
incl.config.init(config)

for epoch in range(incl.config['hyperparamter']['epochs']):
for iter, data in enumerate(trainloader):
loss = train(data)
if (iter + 1) % 1000 == 0:
incl.log({ 'train/loss': loss })

The log collector plays an important role in providing users with visibility into the job's progress. It collects the output of the job executor, including any error messages or other diagnostic information generated by the job. This allows the user to quickly track progress and troubleshoot any issues that may have occurred during the job's execution. Meanwhile, the metric collector collects system-level performance data, such as CPU and GPU utilization. If the average GPU utilization falls below a specified threshold, the job instance is automatically requested to terminate by the backend. This prevents the instance from continuously running when the job hangs due to unexpected issues.

Choosing the Proper Storage

Choosing the proper storage for each situation was an important consideration when designing the INCL architecture. There are three types of storage available: object storage, block storage, and file storage. Each type of storage has its own advantages and disadvantages, and it is important to understand the performance-cost trade-offs associated with each type of storage.

Image source : Google Cloud

Object storage is slow but cheap and is therefore well-suited for storing code and model artifacts that need to be saved in the long-term. When a job is running, the job instance utilizes block storage, which is faster but more expensive. After the training is finished, the job results, such as the checkpoint, are synced from block storage back to object storage, and the block storage is then deleted. Block storage is used during training to ensure high storage performance so that it does not become the bottleneck of performance. However, it is deleted after training is done for cost efficiency.

While block storage has good performance, it has a limit on the number of multiple readers and writers that can access the storage simultaneously. This could limit the scalability of the system. File storage, on the other hand, is more expensive but allows multiple readers and writers to access the storage simultaneously. It is used for storing all the training data and is mounted to the job instance as a network file system. We allow users to manage their training and validation data in file storage via the INCL client.

Various storages used in INCL architecture

Choosing the appropriate storage for each situation is important to ensure that the system operates efficiently and cost-effectively. INCL is carefully designed with the choice of storage with respect to the trade-off between performance and cost.

Conclusion

In conclusion, building a cloud-based deep learning platform is a complex task that requires careful consideration of various technical details and design choices. INCL has been designed with scalability, performance, and ease of use in mind. By addressing key technical challenges and providing a robust and scalable platform, INCL enables users to focus on their deep learning research and development without having to worry about the underlying infrastructure. There are still many areas where we could improve deep learning process in Lunit. If you’re passionate about building and optimizing deep learning system and want to work on cutting-edge technology that’s making a big impact in the industry, consider joining Lunit’s team!

Explore the full spectrum of our Intelligent Cloud (INCL) series. You can easily navigate through the entire series here:

--

--