Developing novel deep learning models is an iterative, experimental process — requiring thousands of training runs to find the right combination of neural network layer configurations and hyperparameters. This is a significant pain-point for data science teams, because designing neural networks is hard enough. Add in the task of planning and managing training runs, and the result is an error-prone, tedious and time-consuming task.
Since a single training session (or experiment) can take hours to complete, data scientists typically spend much of their day setting up training runs for execution overnight, hoping to wake up to useful results the next morning. After each nightly batch of runs, the work begins again: the results of the experiment must be evaluated, the models and hyperparameters must be refined, and the whole training process must be repeated until it yields acceptable results. This may take weeks or even months of painstaking effort.
Deep Learning in Watson Machine Learning
Today, we are excited to announce the launch of Deep Learning as a Service as in IBM Watson Studio.
With the new deep learning service, data scientists can design neural networks either manually, by coding with popular deep learning frameworks such as TensorFlow, Caffe, PyTorch and Keras; or visually, using Neural Network Modeler, a graphical interface that generates code from visually designed neural network structures.
Data scientists can use Watson Studio’s deep learning service to scale to hundreds or even thousands of training runs, while only paying for the resources they use. This eliminates the need to spend time and money on provisioning and managing machine instances, clusters and containers which lets data science teams focus on the most interesting and valuable parts of the job.
Your Deep Learning Assistant
The deep learning additions to Watson Studio comprise a suite of tools called Experiment Assistant that manages and assists throughout your experimental workflow.
Experiment Assistant performs useful tasks like:
- Distributing your source code plus data across GPU-enabled containers to provide each training run with the selected NVIDIA® Tesla® GPU: K80, P100, or V100.
- Starting, tracking then stopping your training runs so you only pay for the resources that you need to execute your jobs. No more starting machine instance then being billed when you forget to shut them down.
- Tracking which hyperparameters were associated with each training run.
- Collecting the assets generated during training from each container and migrating them back to your Cloud Object Storage so all assets are in a single location for easy access.
- Extracting events from your training logs and visualizing them in Watson Studio
You access these capabilities using your preferred tools like the command line tools and python client or the visual interfaces in Watson Studio (discussed later). This allows you to adapt the deep learning service to fit into your existing workflow.
Once you’ve defined and submitted your experiment, each training run is automatically started, monitored and stopped upon completion. Training history and assets are tracked plus results are automatically transferred to a designated Cloud Object Storage repository for quick access.
As a result, there’s no longer any need to stare at text logs to track training progress. Cross-model performance can be viewed in real time, and revisited later — providing both immediate insight and a full retrospective view of how models have evolved over time. This allows data scientists to focus on designing their neural networks, while the system handles the mechanics of the neural network training process.
Neural Network Modeler
The Deep Learning service includes Neural Network Modeler, which provides an intuitive drag-and-drop, no-code interface for designing neural network structures. It speeds up the design process by avoiding the need to write and debug code by hand.
Neural networks can be exported in TensorFlow, Keras, PyTorch and Caffe as well as in JSON format for sharing within blogs and code posted to Github.
A key challenge in deep learning is how to efficiently tweak a neural network’s hyperparameter space to achieve the best performance in the fewest training runs. Tuning hyperparameters manually typically results in long cycle times with sub-optimal results. IBM Watson Studio has a built-in Hyperparameter Optimization (HPO) feature which automates the tuning process by programmatically training neural networks across a range of hyperparameters and using advanced optimization algorithms to select the best performing models.
Distributed Deep Learning (Beta Release)
Today’s advanced neural networks have grown in complexity, and may require terabytes of data to train. Even on a powerful server with several GPUs, training cycles may take anything from hours to days to weeks. To accelerate experimentation rates, training must be distributed across multiple processors and multiple machines.
The distributed deep learning capability in Watson Studio is built upon IBM’s distributed deep learning technology and the latest open source framework technologies like TensorFlow’s native distributed training and Uber’s Horovod. This new beta feature handles compute across many servers, each with multiple GPUs. This combination of technologies reduces model training times by orders of magnitude by distributing the training runs across clusters that may contain hundreds of GPUs — as well as reducing the complexity of the code.
Take the next step