Ray Tune at NERSC

Mustafa Mustafa
Distributed Computing with Ray
4 min readSep 11, 2020

by Mustafa Mustafa (NERSC), Brandon Wood (NERSC), Steven Farrell (NERSC) and Richard Liaw (Anyscale)

Displayed above are a few, of the many, materials and adsorbates being investigated at the National Energy Research Scientific Computing Center (NERSC) for catalysis applications.

Deep learning is posited to tackle many challenging problems in applied and fundamental sciences in the upcoming years, from accelerating cosmology simulations that help us understand the structure and content of the universe to designing new materials with desired properties. However, building such deep learning solutions requires a methodical approach to tuning different free parameters associated with both the model and the learning algorithm — known as hyperparameters. Searching for best hyperparameters configurations a) requires a lot of engineering of workflow pipelines to prepare, run, and evaluate each experiment, and b) can be computational resource intensive (mostly wasteful) and time consuming if done in a naive way, e.g. by doing a grid-search over the space of hyperparameters. This process of selecting configurations and evaluating them to optimize for best performance on a task is known as Hyper-Parameters Optimization (HPO), and it can be streamlined with good software engineering solutions and made more resource and time efficient with smarter search algorithms.

HPO frameworks are useful tools that automate the hyper-parameters optimization process while making efficient use of available computing resources. The framework takes care of selecting hyper-parameters to test and managing trials (training a model with a set of parameters), that is, assigning resources and starting new trials, stopping unpromising trials, and evaluating finished trials. A good framework would require little workflow engineering from the user to do this trial management.

The National Energy Research Scientific Computing Center (NERSC) at Berkeley Lab is preparing to receive its first GPU HPC system, Perlmutter, which will host over 6000 NVIDIA Ampere GPUs. The system is expected to enable a wide range of Department of Energy (DOE) Deep Learning for Science applications. A scalable HPO framework is imperative to deliver the maximum science capability of Perlmutter. To this end, Ray Tune has been deployed and tested on NERSC’s GPU development system, Cori-GPU. Below are some thoughts on the requirements we looked for in an HPO framework and our experience with Ray Tune in relation to meeting those requirements:

  1. SLURM Integration: it should be easy to integrate with Slurm to run parallel trials on multi-gpu nodes, to easily scale to any number of nodes
  2. Automatic scheduling: the framework should handle scheduling of trials (without requiring the user to manage nodes and GPU bindings)

These two criteria are concerned with scheduling trials. Given that Ray Tune is powered by Ray, it can manage any resources it is given with little boilerplate code from the user. We wrote minimal code to build a ray cluster with Slurm. Our scripts are simple and generic enough that any of our users can utilize them to run Ray Tune in multi-node, multi-gpus mode.

3. Stop and Resume: the framework should able resume the HPO process

This is very important to accommodate HPO campaigns that need to run longer than SLURM job queue limit. Ray Tune can seamlessly restart an HPO job and continue the training and search. This feature, in addition to allowing users to run long campaigns, also allows for resuming the search with different resources, e.g. at the beginning of the search you might want to use many more nodes to try more experiments in parallel while reducing the resources for later jobs.

4. Optimization: the framework should provide a wide variety of scheduling and parameter search algorithms, including the state-of-the-art and most popular ones.

Ray Tune allows users to choose among popular trials scheduling algorithms and natively supports many search libraries. Users can switch between state-of-the-art HPO techniques with a few lines of code.

5. Minimal code changes: the framework should have a clean API; requiring minimal boilerplate code to launch an optimization process while being expressive enough to define complex search spaces

Ray Tune’s API allows users to quickly wrap their own training loop with a Ray Tune trainable class, and with a few lines of code one can start the optimization process. We particularly like the ability to nest search spaces.

In conclusion, Ray Tune is a promising user friendly and scalable open-source solution for deep learning HPO, we are currently using Ray Tune to optimize a Graph CNN model for catalysis (Figure 1), a model that corrects under-resolved fluid dynamics simulations, and a model for emulating cosmology hydrodynamics. In the future, we envision Ray Tune will enable more deep learning for science on NERSC HPC systems.

Figure 1. Average time-to-solution plot comparing three HPO strategies; asynchronous successive halving algorithm, asynchronous hyperband, and random search all using the same compute resources, 4 Cori GPU nodes for 8 hours. The number of completed trials for each strategy is in the upper right corner. For further details and discussion please see this blog post.

--

--