How to scale the BERT Training with Nvidia GPUs?

Jonathan Hui
NVIDIA
Published in
21 min readJan 19, 2020

In 2015, ResNet-50 and ResNet-100 were introduced with 23M and 45M parameters respectively. Fast forward to 2018, the BERT-Large model has 330M parameters. Unfortunately, the computer processing speed has not caught up and it takes months to train the BERT-large model with a single GPU. In this article, we discuss methods that scale Deep Learning training better. In specific, we look into Nvidia’s BERT implementation to see how the BERT training can be completed as short as 47 minutes. We will focus on its principles, in particular, the new LAMB optimizer that allows large-batch-size training without destabilizing the training. In addition, we will go through some of the specifics of Nvidia’s implementation.

Challenge

Before looking into the solution, let’s look into the scaling issue systematically.

Asynchronous

Deep learning (DL) training is largely based on the optimization of a cost function. Many optimization methods, like gradient descent, run steps iteratively. But this limits the possible parallelization.

Nevertheless, this sequential constraint can be slightly relaxed with asynchronous parameter updates, i.e. sequential iterations can be overlapped using slightly outdated model parameters. However, applying asynchronous updates alone, the training accuracy suffers in experiments. Therefore, the synchronous parameter update remains more popular.

Better accuracy

Because of this sequential constraint, let’s focus on improving the training efficiency of each iteration. Many gradient descent methods apply approximation. By introducing a second-order derivative, we can improve the accuracy by taking the curvative of the cost function into consideration. And this approach shows promises in reinforcement learning. However, it also increases computation complexity significantly. So even it is more accurate, its benefits in the overall training time in DL remain questionable.

Better parameter updates

Indeed, most DL training adopts a less complex approach by focusing on tuning the learning rate or introduce better parameter updating schemes based on the first-order derivative. The latter approaches include the momentum-based parameter updates and the per-parameter adaptive learning rate methods like the RMSprop. The popular Adam optimizer combines both concepts together. In this article, we will detail and reapply some of these concepts to improve training efficiency.

Better parallelism

To speed up training, we can improve the parallelization in each iteration. There are two common approaches: model parallelism and data parallelism. In model parallelism, we partition a model into parts and run different parts on different GPUs. Nevertheless, the sequential nature of the model still restricts the possible parallelism.

Bigger batch size

In data parallelism, each GPU computes the gradient loss for different data. For example, with a batch size of 1024, we can use 16 GPUs with each responsible for 64 training samples. A larger batch size cancels out information noise and hopefully produces better gradient descent. Nevertheless, to decrease the overall training time significantly, we also need a more aggressive learning schedule. In many DL training, this is achieved by increasing the learning rate. For example, after increasing the batch size by k times, we can increase the number of GPUs by k times also to keep the per-iteration training time constant. Then, we increase the learning rate by k times to speed up the training. Unfortunately, in 2014, Krizhevsky acknowledged that

But very big batch sizes adversely affect the rate at which SGD converges as well as the quality of the final solution.

Better Optimizer

Sometimes, we make a blanket statement that large batch size hurts training. But to be explicit, once we pass a certain batch size (problem-specific), we must up the ante for the learning scheme to match the aggressive descent. Otherwise, the model will not converge well. In this article, we will detail the scheme used to reduce the BERT training to 47 minutes. In particular, Nvidia’s implementation switches from Adam optimizer to LAMB optimizer. This is the centerpiece in scaling the training using a much larger batch size. Figuratively, we will learn how to break the rule of 9 women cannot make a baby in a month. But we need to understand some basic concepts and other training tips first.

BERT Training Strategy

In this section, we will have a quick overview of the training. We assume you have a basic understanding of BERT, if not, please refer to this article.

Bert has two models: the BERT-base and the BERT-large. The table below shows the number of layers and parameters.

Source

For example, the BERT-base model uses 12 encoder layers with each encoder having the design drawn in the left below.

BERT training has two stages:

  • Pre-training to generate a generic dense vector representation for the input sentence(s), and
  • Fine-tuning to solve a DL problem like question and answer.

The pre-training is done by un-labeled data to produce a generic vector representation of the input sentence(s). It is the fine-tuning stage that we need a specifically labeled dataset for the corresponding DL problem. In terms of model design, we simply add one or two fully connected layers on top of the model in the pre-training to produce the fine-tuning model.

Source

Since the BERT-large model achieves better accuracy, we will focus our discussion on this larger model in this article. This model is extremely large and companies use it to demonstrate how they scale the training.

General Tips in BERT training

Rome is not built in a single day. Many trainings in DL are done in phases. Some researchers gradually increase the complexity to make the training more stable, faster and/or to avoid nasty local minima. For BERT, it uses 2 phases in the pre-training. The first phase uses a shorter input sequence of length 128. The second phase uses fewer training steps but a longer sequence length of 512. Here is the documentation on running the pretraining script in Nvidia’s implementation.

run_pretraining_lamb.sh - Runs pre-training with LAMB optimizer using the run_pretraining.py file in two phases. Phase 1 does 90% of training with sequence length = 128. In phase 2, the remaining 10% of the training is done with sequence length = 512.

Here is the justification quoted from Google’s implementation:

Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. In other words, a batch of 64 sequences of length 512 is much more expensive than a batch of 256 sequences of length 128. The fully-connected/convolutional cost is the same, but the attention cost is far greater for the 512-length sequences. Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps with a sequence length of 512. The very long sequences are mostly needed to learn positional embeddings, which can be learned fairly quickly.

Reuse of pre-trained models

Pre-training a BERT-large model takes a long time with many GPU or TPU resources. It can be trained on-prem or through a cloud service. Fortunately, there are pre-trained models available to jump-start the process. For example, we can load the Transformer with a pre-trained model first and then further pre-trained it with a domain-specific corpse using a smaller training rate. This transfer training usually takes less time to complete.

For example, the shaded area in the following paragraph is autogenerated by a Transformer trained with a technical journal corpse.

Generated from source using GPT-2 model

BioBERT is another example that uses PubMed abstracts and PMC full-text articles to further pre-train the BERT model. This is done with biomedical corpora for 23 days on eight Nvidia V100 GPUs in the original research paper.

Source

For the fine-tuning part, it can be completed in hours with a single GPU. Many fine-tuning trainings can be stopped in 2 epochs.

Large Mini-Batch Size

The sequential dependency on the model’s operations and the iterative optimization methods limit how far parallelization can be done. If we cannot design a highly parallel system, we cannot scale the training by adding new GPUs. As discussed in the “Challenge” section, this is not easy.

We can increase the batch size to increase data parallelism. However, to shorten the training time, we also need an aggressive training schedule (a higher learning rate). Nevertheless, this strategy can easily backfire and destabilize training. In the next few sections, we will first go through some basic tips before looking into LAMB.

Linear Scaling Rule

When the mini-batch size n is multiplied by k, we should multiply the starting learning rate η by the square root of k as some theories may suggest. However, with experiments from multiple researchers, linear scaling shows better results, i.e. multiply the starting learning rate by k instead.

Gradual Warmup Strategy

However, we don’t jump to this initial learning rate immediately. Instead, we start with η and increment it by a fixed amount such that it reaches after a pre-defined number of warmup steps. This gradual warmup will provide training accuracy similar to the smaller batch size as shown below in ImageNet training (using the same amount of training samples but fewer training steps).

Source

By applying these two techniques, we can push the mini-batch size up to 8K in ImageNet training without a loss of accuracy.

Source

Optimizer

To push the mini-batch size higher, we need to switch away from Adam optimizer to LAMB optimizer. But let’s do a quick review of a few optimizers first which LAMB gets some ideas from.

Momentum SGD

Conceptually, momentum SGD can be viewed as a weighted average of the current gradient and the previous gradients. This weighted approach reduces the gradient noise. To reduce the effect of older updates, we decay the values by m in every iteration (polynomial decay). For example, u₃ = m u + m² u₁ + …

Modified from source

RMSProp

The major problem in vanilla DL optimizers is having a single learning rate for all model parameters. This one-size-fits-all solution is hard to please every descent directions with different curvatures. On the other extreme, we can control the learning rate for each model parameter. But since there are millions of parameters, we need to do it implicitly.

The first equation below record velocity v which behaves as a polynomial weighted average of the gradient’s magnitude. To be more precise, its root value measures a weighted average of the magnitude of the recent parameter changes.

Modified from Wikipedia

During parameter updates, we divide the global learning rate η with its root. Effectively, we damp the changes if a parameter changes too fast and too large recently. In combination with the global learning rate, we regulate the learning rate on each node.

Adam

Adam optimizer simply combines the ideas in Momentum and RMSProp in updating the model parameters as below.

Source

AdamW

L2 regularization and weight decay regularization are equivalent in vanilla SGD — they are mathematically equal in the gradient descent method. But for adaptive methods like Adam, they are not.

In AdamW, it applies the weight decay in the parameter updates (green highlight below) instead of applying the L2 regularization in the gradient. They will produce different results. Empirical results show AdamW trained models are less overfitted and better generalized.

Source

Layer-wise Adaptive Rate Scaling (LARS)

In RMSProp, each parameter learns at a different rate and independently with others. Maybe, we should take some middle ground where the learning rate is sensitive at the layer level — somewhere between global and node levels.

Experimental observations indicate that the weight and gradient magnitude ratio is very different at different layers.

This provides support that the learning rate for each layer should be controlled separately which may help the slow-learning layer to learn faster.

The model parameter changes in gradient descent are proportional to the gradient. An un-proportional or un-controlling parameter increase risks the divergence of the model. Many other algorithms including gradient clipping and weight normalization are introduced to avoid this possible runaway train.

LARS introduce the following local layer-level learning rate below which normalizes the gradient with the magnitude of gradients for that layer.

Modified from source

This new equation allows LARS to move parameters in the direction of the steepest gradient descent with a magnitude proportional to ‖wˡ‖. This normalization hopefully mitigates some diminishing and exploding gradient problems. With the trust ratio below as part of the learning rate, we can help the slow-learning layer to use a higher learning rate while regulating the exploding changes also.

In addition, for model regularization, LARS introduces β below to perform a weight decay.

Source

Here is the final optimization algorithm of LARS with polynomial learning rate decay added for the global learning rate plus the momentum SGD.

Modified from source

LARS pushes the Resnet-50 training to a batch size of 32K without loss in accuracy. However, it performs badly for BERT.

LAMB (Layer-wise Adaptive Moments optimizer for Batch training)

LAMB uses the same layerwise normalization concept in LARS so the learning rate is layer sensitive. But for the parameter updates, it substitutes the momentum concept with the AdamW instead.

Left side source

In LAMB, the weights and biases are considered as two separate layers because both have very different trust values and therefore should be treated with different learning rates.

LAMB pushes the mini-batch size to 32K. This is the centerpiece algorithm in large batch training. Now we are ready to discuss the BERT training with Nvidia GPUs in detail.

BERT Datasets

Pay special attention to this section if you want to try out the original BERT pretraining yourself. The BERT training in the original researching paper contains 800M words in the BookCorpus and 2,500M words in the English Wikipedia for pre-training. Nvidia implementation provides scripts to download the pre-training datasets. Nevertheless, the site providing the BookCorpus data will block your IP access after downloading more than 500 articles. To produce 800M words, you need 12K articles. You can ignore the BookCorpus dataset or use a much smaller BookCorpus. The first approach requires some simple changes to the Nvidia scripts. If you want to collect the BookCorpus, alternative approaches are listed at the bottom of the articles.

For me, the downloading of the English Wikipedia cannot be completed with the Nvidia script. So I download the zip file manually and use the same script to process it (details later). To store and to prepare the datasets, you may need 600GB+ storage.

Nvidia implementation contains other scripts. Fine-tuning datasets and pre-trained Google’s BERT models (BERT-base and BERT-large) can be downloaded with those scripts easily.

Out of Memory Issue

Memory is a sensitive issue in scaling DL training. Both Google and Nvidia’s implementations are run on a powerful host with top of the line TPUs and GPUs. Sometimes, the implementation may be less vigilance in resource control. You may need to modify the code to fit into your resource constraint sometimes. Nvidia’s implementation is originated from Google implementation but then optimize for less GPU memory consumption and speed improvement, in particular, taking advantage of Nvidia’s GPU hardware and AMP.

Let’s focus on Google’s Implementation first. All fine-tunings in the BERT paper is done on a single Cloud TPU with 64GB memory. For most of the fine-tuning experiment in the BERT paper, you need more than 16GB GPU memory for BERT-Large. All the mini-batch training assigned to a GPU must fit inside the GPU memory all at once. For example, for a mini-batch size of 64, you cannot split it up into two and later combine the results. The code is not written this way. After each mini-batch training, the model parameters are updated. It does not sound bad but I will explain the problem next.

As shown below, for a 12GB GPU, the maximum batch size is 12. Unfortunately, this batch size is too small to produce the same performance result in the BERT paper. Likely, the noisy gradients make the model harder to converge.

Source

Gradient accumulation

You may ask why we don’t collect more mini-batches and combine the results before the model update. Indeed, it is called gradient accumulation and it is pretty simple to implement. Here is some sample demonstrating the general concept.

Now, let’s focus on Nvidia’s implementation. It utilizes gradient accumulation and AMP (discuss later) to lower the GPU memory requirement. For a single 16GB GPU, you may be able to train BERT-large with the 128-word sequence with an effective batch size of 256 by running batch size 8 and accumulation steps equal 32. i.e. the results of 32 mini-batches of size 8 are combined to form an effective batch size of 256.

But it still requires enough memory to train at least one single sample. I receive an OOM message when running a sequence length of 384 with 11GB GPU memory. Unfortunately, further memory reduction requires more advanced techniques. If you are interested in those techniques, links are provided in the reference section. Without such modification, many people train the BERT-base instead if they have only an 11GB GPU like 2080TI.

For the rest of the article, I will use the Titian RTX GPU with 24GB to duplicate the training.

Automatic Mixed Precision (AMP)

Many DL models are trained with 32-bit precision floating point math. Mixed precision uses 16-bit precision in computing the node activation and gradient instead. We can cut memory consumption by half. In practice, it is less because we still need to keep a master copy of weights in 32 bits as well as other aggregated data.

Source

In Nvidia’s BERT implementation, mixed-precision can be turned on automatically by using the “use_fp16” flag in the command line which simply turns on an environment variable in the code. The underneath engine will automatically use 16-bit precision for the gradient calculations.

if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"

Many algorithms save memory at the cost of computation complexity. Nevertheless, since AMP switch to simpler math, according to Nvidia’s presentation, it can speed up BERT 3x when TensorCore is used.

Source

But you do need to choose an Nvidia architect with TensorCore.

In addition, you will need to install Nvidia’s docker image for BERT. The Nvidia’s docker setup is pretty simple. Also, I simplify the mixed-precision in our discussion but more details are in this AMP article.

NVLAMB

Nvidia’s BERT implementation is slightly different from LAMB. There is an extra step (2) below which normalizes the gradient using all the nodes’ gradients. This extra-normalization likely helps the training less vulnerable to the scale of the gradients.

Source

Here is the improvement in loss when this extra step is introduced.

Source

Nvidia’s BERT implementation

For the remaining sections, we will get more details into Nvidia’s implementation. For BERT, LAMB can achieve a global batch size of 64K and 32K for input sequence lengths of 128 (phase 1) and 512 (phase 2) respectively. With a single GPU, we need a mini-batch size of 64 plus 1024 accumulation steps. That will takes months to pre-train BERT.

Source

Nvidia builds the DGX SuperPOD system with 92 and 64 DGX-2H machines respectively in 2019 and finishes the training in 47 and 67 minutes.

Source

DGX-2 costs about $400K each. One possibility in accessing such infrastructure is through the cloud service like the announced Microsoft Azure’s NDv2 instance that has 800 Nvidia V100 Tensor Core GPUs. Yet, this is subject to the organization’s usage and use cases. For example, if the corpse may change over time or require multiple training, the price of the cloud solution may add up.

On multi-node systems, LAMB can scale up to 1024 GPUs with 17x training speedup compared to Adam optimizer.

Software setup

Before using the Nvidia implementation, you need to set up the Nvidia docker environments. The ultimate goal is to install the Nvidia GPU-Accelerated Containers (a Docker image). It requires Docker, Nvidia Docker, NGC container. This article should contain all the software setup that you may need. Or you can follow Nvidia’s instruction here.

Pre-training, fine-tuning and inferencing BERT

This link is Nvidia’s instruction on pre-training, fine-tuning and inferencing BERT. It is pretty simple. But I will go through the important steps here and share a few important tips. In addition, I encounter a few issues and I list the resolutions for your reference at the end of the article. But this is very sensitive to your installed version and setup. So please read it with caution.

  1. Clone the repository
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT

2. Build the BERT TensorFlow NGC container. (I encounter 2 issues here. See the issue section if you encounter problems.)

bash scripts/docker/build.sh

3. Download and preprocess the dataset

bash scripts/data_download.sh

This script eventually invokes data/create_datasets_from_start.sh. I recommend dividing the tasks in create_datasets_from_start.sh into 6 separate steps as some of them may fail or takes very long to complete. So comment out the others and run data_download.sh again for the specific steps that you need. But do not jump steps since they may depend on previous steps. Before proceeding, check out the next section also.

# (Step1)
# Downland bookscorpus
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset bookscorpus
# (Step2)
# Download English Wikipedia
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset wikicorpus_en
# (Step3)
# Download pre-trained model and datasets for the fine-tuning
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab

python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset squad
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset "CoLA"
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset "MRPC"
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset "MNLI"
# (Step 4)
# Properly format the text files
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action text_formatting --dataset bookscorpus
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action text_formatting --dataset wikicorpus_en


# (Step 5)
# Shard the text files (group wiki+books then shard)
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action sharding --dataset books_wiki_en_corpus

# (Step 6)
# Create TFRecord files Phase 1
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset books_wiki_en_corpus --max_seq_length 128 \
--max_predictions_per_seq 20 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt


# Create TFRecord files Phase 2
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset books_wiki_en_corpus --max_seq_length 512 \
--max_predictions_per_seq 80 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt

Download and preprocess

Here is the summary and issues that you may encounter:

  • BookCorpus: Provided scripts will be blocked after 500 articles download. See the end of the articles for alternatives. This may take some time to perform.
  • English Wikipedia: Download the latest Wikipedia dump directly in a browser. Rename and move it as the file below before running the script.
DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/wikicorpus_en/wikicorpus_en.xml.bz2
  • Google’s pre-trained BERT-large and BERT-base models: Run the script as-is.
  • Fine-tuning datasets: Run the script as-is.
  • Text formatting: Run the script as-is. This script ignores un-related texts like header from the downloaded files and extracts the needed text only.
  • Sharding: It partitions the data into shards required for multiple GPU training. For a single CPU, it will take a few hours to complete. Even with 32GB host RAM memory, I run out of memory in sharding my dataset (about 2,800M words). I modify the code and reduce the memory footprint required in TextSharding.py. I can complete the run successfully. However, the issue depends on the size of your dataset and your host memory. And it requires more aggressive change for even larger dataset size. But unless you are very short in memory compared to the dataset size, the code change should be relatively easy to figure out. Because of this variant, I decide not to discuss or supply the code change here. Alternatively, you can use cloud service with 64GB or 128GB host memory to finish the processing. It should take about a few hours. Again, the need to pre-train the model again depends on your problem domain.
  • Create TFRecord files for faster data processes in Tensorflow. Run the script as-is. For a single CPU, it may take a day to complete.

Pre-training and fine-tuning

First, you need to launch the NGC image in docker. (I encounter 1 issue here on the legacy nvidia-docker command. See the issue section at the end.)

bash scripts/docker/launch.sh

Once you are inside the docker, you can run other training scripts.

For example, the one below will start the pretraining with LAMB using 64 mini-batches per GPU using 8 GPUs. This setting is based on DGX-1 with eight Nvidia V100 with 32G memory.

bash scripts/run_pretraining_lamb.sh 64 8 8 7.5e-4 5e-4 fp16 true 8 2000 200 7820 100 128 512 large

However, if you have multiple GPUs in your system, I suggest you running nvidia-smi first to verify which GPU will you be used in running your application.

Below are the default settings for the pre-training script (targeted for DGX-1 again).

Source

For my Titan RTX with 24G memory, I reduce the batch size to 48 so I don’t get OOM. Change accumulation steps accordingly if you want.

bash scripts/run_pretraining_lamb.sh 
<train_batch_size_phase1> <train_batch_size_phase2> <eval_batch_size>
<learning_rate_phase1> <learning_rate_phase2>
<precision> <use_xla>
<num_gpus>
<warmup_steps_phase1> <warmup_steps_phase2>
<train_steps> <save_checkpoint_steps>
<num_accumulation_phase1> <num_accumulation_steps_phase2> <bert_model>

For users that have a GPU with less than 16G memory, you will likely encounter OOM even with a batch size of 1 in pre-training or fine-tunning. Therefore, most people use the smaller BERT-base model instead for those GPUs.

To fine-tuning on the BERT-large model with a pre-trained model,

bash scripts/run_squad.sh 10 5e-6 fp16 true 10 384 128 large 1.1 data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_model.ckpt 1.1

Here is the syntax in pre-training the SQuAD.

bash scripts/run_squad.sh 
<batch_size_per_gpu> <learning_rate_per_gpu>
<precision> <use_xla>
<num_gpus>
<seq_length> <doc_stride>
<bert_model> <squad_version> <checkpoint> <epochs>

For the Titan RTX GPU, it will take less than 2 hours to train a single epoch and it takes about 2 epochs for the fine-tuning.

Again, refer to the Nvidia Readme for other scripts and commands.

Accuracy & Speed

As shown below, we get a nice 2x speed improvement by 4x the batch size with 2x GPUs.

Here are some other key performance numbers for your reference. I will let you interpret the data yourself. This section is for your reference and the accuracy and speed table are originated from here

Pre-training training performance with single-node on V100 16G GPUs.

Pre-training training performance with single-node on V100 32G GPUs.

Multiple nodes

DGX1 has 8 GPUs per node and DGX2H has 16 GPUs per node.

Fine-tuning

Fine-tuning training performance for SQuAD on DGX-2 32G

Inference performance

Inference performance with SQuAD on Tesla T4 (1x T4 16G)

Other Issues (Optional)

Here are other issues involved with Nvidia’s BERT implementation. However, this contains time-sensitive information. Please, note that it may be resolved with your downloaded version or you may encounter something new.

Problem with the docker build script.

bash scripts/docker/build.sh

If you encounter the following messages,

scripts/docker/launch.sh: line 7: nvidia-docker: command not foundordocker: Error response from daemon: Unknown runtime specified nvidia.

Replace the “runtime” flag below

docker run --runtime=nvidia -v $PWD:/workspace/bert \                           --rm --shm-size=1g --ulimit memlock=-1 \                           --ulimit stack=67108864 --ipc=host -t -i \                           bert bash -c "bash data/create_datasets_from_start.sh"

with “gpus” and use “docker” instead of “nvidia-docker”.

# For data_download.sh# Use all GPUs
docker run --gpus all -v $PWD:/workspace/bert \
--rm --shm-size=1g --ulimit memlock=-1 \
--ulimit stack=67108864 --ipc=host -t -i \
bert bash -c "bash data/create_datasets_from_start.sh"
# Use specific GPUs
docker run --gpus '"device=0,1"' -v $PWD:/workspace/bert \
--rm --shm-size=1g --ulimit memlock=-1 \
--ulimit stack=67108864 --ipc=host -t -i \
bert bash -c "bash data/create_datasets_from_start.sh"
# For launch.shdocker run --gpus all \
--rm --shm-size=1g --ulimit memlock=-1 \
--ulimit stack=67108864 --ipc=host -t -i \
--net=host \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v $PWD:/workspace/bert \
-v $PWD/results:/results \
bert $CMD

BERT Implementations

Google BERT’s implementation

Nvidia BERT’s implementation optimized from Google’s implementation

Hugging Face

Reference & Credits

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT For TensorFlow

Pretraining BERT with Layer-wise Adaptive Learning Rates

NVIDIA Container Support Matrix

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Large batch training of convolution networks

Large batch optimization for Deep Learning: Training BERT in 76 minutes

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

Optimization Methods for Large-Scale Machine Learning

One weird trick for parallelizing convolutional neural networks

Memory reduction: Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

Download Book Corpus Dataset

Please study any copyright issues in detail before further proceed.

Gutenberg Dataset

Replicating the Toronto BookCorpus dataset — a write-up

Homemade BookCorpus

--

--