Member-only story

SUPERCOMPUTING FOR ARTIFICIAL INTELLIGENCE — 04

Train a Neural Network on multi-GPU with TensorFlow

Data parallelism approach using tf.distribute.MirroredStrategy

Jordi TORRES.AI
TDS Archive
Published in
8 min readNov 29, 2020

--

Marenostrum Supercomputer — Barcelona Supercomputing Center (image from BSC)

[This post will be used in the master course Supercomputers Architecture at UPC Barcelona Tech with the support of the BSC]

In the previous post we introduced the basic concepts in distributed and parallel training Deep Learning. We also presented the frameworks and infrastructure that we will use to scale the learning process of different neural networks for classifying the popular CIFAR10 dataset.

Now, we are going to explore how we can scale the training on Multiple GPUs in one Server with TensorFlow using tf.distributed.MirroredStrategy().

1. Parallel training with TensorFlow

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPU or TPUs with minimal code changes (from the sequential version presented in the previous post). This API can be used with a high-level API like Keras, and can also be used to distribute custom training loops.

tf.distribute.Strategy intends to cover a number of distribution strategies use cases along different axes. The official web page of…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Jordi TORRES.AI
Jordi TORRES.AI

Written by Jordi TORRES.AI

Professor at UPC Barcelona Tech & Barcelona Supercomputing Center. Research focuses on Supercomputing & Artificial Intelligence https://torres.ai @JordiTorresAI

Responses (2)