Member-only story
SUPERCOMPUTING FOR ARTIFICIAL INTELLIGENCE — 04
Train a Neural Network on multi-GPU with TensorFlow
Data parallelism approach using tf.distribute.MirroredStrategy
[This post will be used in the master course Supercomputers Architecture at UPC Barcelona Tech with the support of the BSC]
In the previous post we introduced the basic concepts in distributed and parallel training Deep Learning. We also presented the frameworks and infrastructure that we will use to scale the learning process of different neural networks for classifying the popular CIFAR10 dataset.
Now, we are going to explore how we can scale the training on Multiple GPUs in one Server with TensorFlow using tf.distributed.MirroredStrategy()
.
1. Parallel training with TensorFlow
tf.distribute.Strategy
is a TensorFlow API to distribute training across multiple GPU or TPUs with minimal code changes (from the sequential version presented in the previous post). This API can be used with a high-level API like Keras, and can also be used to distribute custom training loops.
tf.distribute.Strategy
intends to cover a number of distribution strategies use cases along different axes. The official web page of…