Determined — A Batteries-Included Deep Learning Training Platform

Focus on building models, not managing infrastructure

Published in

PyTorch

5 min readJan 28, 2021

Authors: Angela Jiang, Neil Conway

PyTorch is a great way to express the essential properties of a deep learning model, such as the model architecture, which optimizer to use, how to load data, and the training loop. But achieving success with deep learning requires tackling a host of additional problems from scaling training to multiple machines to running hyperparameter tuning to tracking experiment results. It often falls to the model developer to stitch together and manage several tools to address these issues.

Determined AI provides an integrated deep learning training platform that offloads these common deep learning infrastructure problems, freeing up developers to focus on model development. By using Determined to train your model, you get easy access to core deep learning training functionality, including distributed training, hyperparameter tuning, GPU sharing, and experiment tracking. Determined integrates these features into an easy-to-use, high-performance deep learning environment and can be run on-premise or on the cloud on AWS and GCP, as well as on clusters running Kubernetes. Today, we’ll walk through how Determined simplifies deep learning training and how to start training with Determined using PyTorch and AWS. The approach we walk through today is similar to that of running on GCP or locally.

Determined’s integrated training platform makes building models fast and easy

Determined’s PyTorch Trial API: Separating model and training code

PyTorch is highly flexible, and that flexibility can lead to redundancy when expressing models that all need to do common tasks. Often, a model definition is mixed with lots of boilerplate code required to train the model. Determined helps solve this problem by providing the PyTorchTrial API, which lets users describe their model without the boilerplate. Determined replaces the need for boilerplate code by providing a state-of-the-art training loop with distributed training, hyperparameter search, automatic mixed precision, reproducibility, and many more features. This separation of model definition from training code has grown in popularity over the years, with other tools like PyTorch Ignite or PyTorch Lightning providing similar APIs to PyTorchTrial API. And soon, you will be able to directly run PyTorch Lightning code on Determined!

Training using a Determined cluster

To use Determined, you can either start by reorganizing your existing code using the PyTorchTrial API or you can start with one of our model examples. You’ll then need to install Determined and deploy a Determined cluster locally or on the cloud. The Determined cluster contains a master node which is in charge of dispatching tasks to agents and storing metadata. The master node can launch a number of agent nodes to execute training. The number of agent nodes launched depends on the cluster configuration and the training workload. To train a model, users submit an experiment consisting of a model definition and configuration file to the Determined master. Users write the Determined model definition using our PyTorch Trial API, specifying the model, optimizer, what a single training and evaluation step looks like, and the training and evaluation DataLoaders. The configuration file is a YAML file with easily-configurable options to pass to Determined, including training duration, data location and hyperparameters.

Once an experiment is submitted, Determined handles the engineering challenges that arise for deep learning training. For instance, to allow easy scaling of training to multiple machines, Determined shards data across machines, runs Horovod and tracks experiment results, while ensuring fault tolerance and reproducibility. That means running distributed training is as easy as changing a line in a configuration file.

Getting started with Determined

Next, we’ll walk through the steps you’ll need to start training with Determined using PyTorch and AWS.

Remove the boilerplate from your PyTorch code

Here is an example of how to use our PyTorch Trial API to organize your code in a way that decouples the machine learning from boilerplate code and infrastructure management.

Make a configuration file

In a YAML configuration file, specify the location of your model definition, hyperparameters and configuration data.

# in config.yamlentrypoint: model_def:MyCIFARTrial
hyperparameters:
    global_batch_size: 32

Install and deploy Determined

Install Determined via pip. Determined can run on locally or on the cloud (AWS, GCP). Here is an example of running in AWS. To run this example, you may need to increase your EC2 instances limits.

pip install determined-cli determined-deploydet-deploy aws up --cluster-id YOUR_ARBITRARY_CLUSTER_ID --keypair YOUR_KEYPAIR_NAME

Single GPU Training

That’s it! You’re ready to deploy an experiment using the CLI. Specify your configuration file and the directory where your model code is located.

det experiment create config.yaml .

Once your experiment is deployed, use the web UI to view TensorBoard outputs, run Jupyter notebooks, share your results with team members, or replicate an experiment. Code, checkpoints, models, logs, and configurations are automatically saved for you and available for download.

Distributed Training

Scaling training to multiple GPUs or machines is as simple as adding one configuration change. Determined automatically provisions machines and networking, efficiently distributes data loading, and provides fault tolerance. To train with 8 GPUs:

# Add in config.yamlresources:
    slots_per_trial: 8

Hyperparameter tuning

To optimize your model with hyperparameter tuning, specify the hyperparameters and the search algorithm in the configuration file. Determined parallelizes hyperparameter tuning experiments across machines in your cluster. To configure an HP tuning experiment with state-of-the-art ASHA:

# Add in config.yaml
hyperparameters:
    learning_rate:
        type: log
        minval: -5.0
        maxval: 1.0
        base: 10.0
    global_batch_size:
        type: int
        minval: 16
        maxval: 64
    searcher:
        name: adaptive_asha
        metric: validation_error
        max_length:
            epochs: 32
        max_trials: 16

Resource management

Determined also has support for you as your team scales. Built in resource management makes it easy to operate an on-premise cluster and improve your cluster’s utilization. The cluster scheduler allows team members to submit jobs and get scheduled onto GPUs using a fair-share or priority scheduling. But unlike classical cluster schedulers like Slurm, Determined also offers first-class support for deep learning workloads, providing features like scheduling for hyperparameter search, pause and restarting for long-running jobs, and automatic fault-tolerance and seamless utilization of spot instances.

Next steps

We encourage you to give Determined a spin and try this tutorial out for yourself. Our quick start guide contains more details about this example. Also, look forward to new features that make training even easier, including PyTorch Lightning support and lightweight local training without a Determined cluster. If you have any questions or feedback, hop on our community Slack or visit our GitHub repository — we’d love to hear from you!