Tutorial: How to train with multiple GPUs in AllenNLP

Evan Pete Walsh
Aug 24, 2020 · 5 min read

This is part of a series of mini-tutorials to help you with various aspects of the AllenNLP library.

👉 If you’re new to AllenNLP, consider first going through the official guide as these tutorials will be focused on more advanced use cases.

⚠️ Please keep in mind this was written for version 1.0 and greater of AllenNLP and may not be relevant for older versions.

In this tutorial, we’ll show you how to take an existing AllenNLP configuration file and modify it so that the allennlp train command can utilize multiple GPUs on your machine.

Under the hood, AllenNLP uses PyTorch’s distributed training framework, torch.distributed, and so we often use the terms “distributed training” and “multi-GPU training” interchangeably (though technically, distributed training could be done entirely on CPU).

Using torch.distributed has several advantages over the alternative for multi-GPU training, torch’s DataParallel wrapper.

… although DataParallel is very easy to use, it usually does not offer the best performance. This is because the implementation of DataParallel replicates the model in every forward pass, and its single-process multi-thread parallelism naturally suffers from GIL contentions.

- https://pytorch.org/tutorials/beginner/dist_overview.html#torch-nn-dataparallel

Unlike DataParallel, torch.distributed runs a separate Python process for each GPU, so GIL contention is not an issue. The model also doesn’t need to be replicated on every forward pass.

As a result, torch.distributed is usually much faster, especially when training on a large number of GPUs.

The other benefit of using torch.distributed is that we’ll be able to support multi-node training in the future, i.e. training a single model using GPUs spread across multiple servers. But since we don’t have that implemented yet, this tutorial is just going to focus on single node training.

Stay tuned for future releases though 😉 🚀

Now let’s dive into it 👇

For a concrete example, we’ll use the TransformerQA training config, which looks like this:

If you were to run the allennlp train command on this configuration file as-is, it would only utilize a single GPU (provided one is available, otherwise it falls back to CPU).

To make it so AllenNLP utilizes multiple GPUs, all you have to do is add a “distributed” section to your config like this:

"distributed": {
"cuda_devices": [0, 1, 2, 3],

In this example, the “cuda_devices” array tells AllenNLP to use the four GPUs with IDs 0, 1, 2, and 3.

The full configuration file now looks like this:

Note that the only that’s changed is the addition of the “distributed” section at the bottom.

Although nothing has changed in the config other than adding the “distributed” section, there are a few caveats to keep in mind.

First of all, it’s important to know that the “batch_size” parameter to the data loader now represents the batch size per GPU. That means the effective batch size is actually batch_size multiplied by the number of GPUs, or 8 * 4 = 32 in this case.

So with a larger effective batch size, you may need to adjust other hyperparameters in your config, such as the learning rate.

Another thing to keep in mind is how your data will be loaded and split across different GPUs.

By default, each GPU worker will only use a fraction of the total instances from each data file. In this case, since there are 4 GPUs, each worker will get 1/4 of the total number of instances. However, your dataset reader needs to be “aware” of distributed training in order for this to be done efficiently.

Using a naive dataset reader means each worker still actually has to read and create every single instance from each data file. The instances that the worker doesn’t need are filtered out afterward.

This is, of course, not very efficient. Ideally, each worker’s dataset reader should only have to read and create the subset of instances that the worker needs. But for that to happen, the dataset reader needs to have some “distributed-aware” logic in it’s _read() method.

While it’s usually not too difficult to modify any dataset reader to handle this logic, it’s beyond the scope of this tutorial.

For information about implementing a “distributed-aware” reader, see the DatasetReader API documentation.

An alternative way to improve your data loading efficiency out-of-the-box is to use a ShardedDatasetReader. This does, however, require you to manually split each data file into a number of smaller data files, or “shards”, each of which should contain roughly the same number of instances. The number of shards should be equal to (or a multiple of) the number of GPUs you’re using.

But once you have your shards, switching over to using the ShardedDatasetReader is quite simple. All you have to do is change the data paths in your config to point to the directory or archive file where you shards are, and then make a slight modification to the “dataset_reader” section. In our TransformerQA example, we would change our “dataset_reader” section to look like this:

"dataset_reader": {
"type": "sharded",
"base_reader": {
// this is all the same as our old "dataset_reader" section.
"type": "transformer_squad",
"transformer_model_name": transformer_model,
"skip_invalid_examples": true,

Finally, note that there are a few metrics that currently don’t work with multi-GPU training due to the complexities of synchronizing the calculations across workers. But we have an open issue that’s tracking these metrics, and it’s on our list to fix before the 1.2 release.

Despite the caveats, if you’re lucky enough to have multiple GPUs, you should use them! 😉

That’s it! Happy NLP-ing!

I hope this tutorial was helpful. If you find any issues please leave a comment or open a new issue in the AllenNLP repo and give it the “Tutorials” tag:

Follow @allen_ai and @ai2_allennlp on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

AI2 Blog

AI for the Common Good.