💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

Thomas Wolf
Oct 15, 2018 · 9 min read

How can you train your model on large batches when your GPU can’t hold more than a few samples?

⌛️Large batches on one or several GPU(s)

Adam confirms your predicament! 😱Oh no!
The 5-steps of a gradient descent optimization algorithm

😱 Pushing that to the extreme

A “Memory-poor” strategy that needs O(1) memory (but requires O(n²) computation steps) — From Yaroslav Bulatov’s nice post: https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9

🕰 Making the best of a multi-GPU machine

Forward and Backward passes with torch.nn.DataParallel
Number of elements in the output of a language model

⚖️ Balanced load on a multi-GPU machine

Using DataParallelModel and DataParallelCriterion

⏰ Distributed training: training on several machines

The main server (server 1) has an accessible IP and an open port for communication.

🏃 Adapting our Python training script for distributed training

✨ Launching multiple instances of our Python training script

python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 OUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of our training script)
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=1234 OUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of our training script)

HuggingFace

Stories @ Hugging Face

Thomas Wolf

Written by

Natural Language Processing, Deep learning and Computational Linguistics – Science Lead @Huggingface | thomwolf.io

HuggingFace

Stories @ Hugging Face