Introducing PyTorch-accelerated

A simple but powerful library for training PyTorch models with minimal boilerplate

Published in

Towards Data Science

15 min readNov 29, 2021

pytorch-accelerated is a lightweight library designed to accelerate the process of training PyTorch models by providing a minimal, but extensible training loop — encapsulated in a single Trainer object — which is flexible enough to handle most use cases, and capable of utilising different hardware options with no code changes required. pytorch-accelerated offers a streamlined feature set, and places a huge emphasis on simplicity and transparency, to enable users to understand exactly what is going on under the hood, but without having to write and maintain the boilerplate themselves!

The key features are:

A simple and contained, but easily customisable, training loop, which should work out of the box in straightforward cases; behaviour can be customised using inheritance and/or callbacks.
Handles device placement, mixed-precision, DeepSpeed integration, multi GPU and distributed training with no code changes.
Uses pure PyTorch components, with no additional modifications or wrappers, and easily interoperates with other popular libraries such as timm, transformers and torchmetrics.
A small, streamlined API ensures that there is a minimal learning curve for existing PyTorch users.

Significant effort has been taken to ensure that every part of the library — both internal and external components — is as clear and simple as possible, making it easy to customise, debug and understand exactly what is going on behind the scenes at each step; most of the behaviour of the trainer is contained in a single class! In the spirit of Python, nothing is hidden, and everything is accessible.

The purpose of this article is to introduce this library, which I have decided to name pytorch-accelerated (for reasons which will hopefully become obvious later!), describe some of the motivations behind creating it, and demonstrate how you can use it to accelerate your development with PyTorch. The purpose of this article is not to undermine or draw comparisons to any existing solutions, as I am a firm believer that all have their place and will take no offence to someone saying that they prefer how something is done in library x and that they would prefer to use that!

If you would rather just jump straight in, or learn more by doing than reading, a quickstart guide is available here, the documentation is available here or check out the examples on GitHub. Don’t forget to add stars if you find it useful!

Training MNIST using pytorch-accelerated.

Why another PyTorch high-level API?

As part of the Data Intelligence and Design team in Microsoft CSE, we work on a wide range of machine learning projects across many different domains. Due to the complexity of these problems, our work could often be described as applied research — taking inspiration from existing state of the art approaches, adapting, and applying these in new and interesting ways — whilst also emphasising strong software engineering practices such as SOLID; making sure that the code is as simple and maintainable as possible.

As the use and popularity of Deep Learning continues to rise, many of our projects involve developing deep learning solutions, with PyTorch being my tool of choice in these cases. As such, we need a solution flexible enough to handle a huge number of use cases, whilst being easy to extend, but lets us get up to speed as quickly as possible; as many of our projects are very large, we often need to be able to make use of distributed training from day 1! As we often prefer to leverage pretrained models or existing architectures — rather than building models from scratch each time — anything we use needs to be able to easily interoperate with other libraries containing state-of-the-art models, with my personal favourites being timm and transformers.

In the CSE working model, we ‘code with’ our customers, working as a single team for the duration of the engagement. As we work alongside teams with ranging ML maturity levels, many of our teammates are being introduced to PyTorch for the first time, so we need to ensure that the learning curve is as shallow as possible; to not overwhelm them with a huge amount of information to learn, and to make the most out of the time we are working together on the project. We also must be mindful that we want customers to maintain any solutions produced going forward without us, so it is imperative that we keep things as simple to understand and maintain as possible.

Despite being a long-term user of several different solutions in this space, I repeatedly found myself in situations where the existing tools that I was using didn’t quite fit my use-case, either due to the level of complexity introduced, due to having to learn the intricacies of a new tool that sits on top of PyTorch, or finding that the way that the abstractions were defined didn’t quite align with what I was trying to do.

As a result, for my personal use, I decided to create a very simple generic training loop, to ease customers into PyTorch and accelerate development, whilst maintaining the level of flexibility that we require; leveraging existing tools where possible. This was intentionally designed to be a very thin abstraction, with a streamlined feature set, which makes it easy to understand, modify and debug. After receiving positive feedback from both customers and colleagues, I was persuaded to make this available on PyPI for anyone else that may find such a solution useful.

Who is pytorch-accelerated aimed at?

Users that are familiar with PyTorch but would like to avoid having to write the common training loop boilerplate to focus on the interesting parts of the training loop.
Users who like, and are comfortable with, selecting and creating their own models, loss functions, optimizers and datasets.
Users who value a simple and streamlined feature set, where the behaviour is easy to debug, understand, and reason about!

When shouldn’t I use pytorch-accelerated?

If you are looking for an end-to-end solution, encompassing everything from loading data to inference, which helps you to select a model, optimizer or loss function, you would probably be better suited to fastai. pytorch-accelerated focuses only on the training process, with all other concerns being left to the responsibility of the user.
If you would like to write the entire training loop yourself, just without all of the device management headaches, you would probably be best suited to using Hugging Face accelerate! Whilst it is possible to customize every part of the Trainer, the training loop is fundamentally broken up into a number of different methods that you would have to override. But, before you go, is writing those for loops really important enough to warrant starting from scratch again 😉.
If you are working on a custom, highly complex, use case which does not fit the patterns of usual training loops and want to squeeze out every last bit of performance on your chosen hardware, you are probably best off sticking with vanilla PyTorch; any high-level API becomes an overhead in highly specialized cases!

Why does PyTorch need a High-level API?

For those unfamiliar with PyTorch, you may be wondering, why does PyTorch even need a high-level API? Is it really so complex that all of these different solutions have to build on top of it, and if so, why use it at all?!

Due to its flexibility, ease of debugging and a coding style consistent with Python OOP practices, PyTorch is rapidly becoming the tool of choice for Machine Learning research in Python. However, despite its many advantages, PyTorch is missing one thing to the everyday practitioner — the lack of a generic training loop, or ‘fit’ function. As a result of this absence, several libraries have attempted to fill this gap, using various approaches.

Whilst the lack of a generic loop is often cited as an advantage, as it forces practitioners to take responsibility for all parts of the training process, this results in similar boilerplate code being required at the start of each new project. This invites implementation errors, and the lack of a common structure leads to no consistency between training loops in different projects; implementations can vary wildly, which can make it difficult to understand and maintain the codebase unless you are familiar with the project — even with a good understanding of PyTorch!

Additionally, whilst PyTorch provides good, clear abstractions for components, making it easy to get started for simple applications, the complexity quickly increases due to factors such as introducing distributed and mixed precision training, metric calculations, and logging; obscuring the ‘simple’ style loops often presented in tutorials. As the transfer of data to devices has to be explicitly managed, this alone can add a non-trivial amount of boilerplate!

Whilst pure PyTorch is undoubtedly the best approach for highly complex, custom tasks where lots of flexibility is required, it appears as though this level of granularity is too low-level for ‘most’ use-cases.

So, why is it called pytorch-accelerated?

Aside from the fact that the primary purpose of this library is to help you become productive with PyTorch faster, it is also a homage to an important underlying component that the library is built upon.

At least for me, the primary motivator to use a high-level tool was due to the device management, and the potential headaches that this can cause when setting up distributed training. Just to be clear, it isn’t that this is overly complex to do in pure PyTorch, but this does involve making multiple changes to a training script, such as: adding different wrappers around the model, syncing and gathering results between processes, making sure that certain environment variables are set on each node, and using a specialised launch command to kick off a training run.

However, in April 2021, Hugging Face released the excellent accelerate library, which encapsulates all these concerns into a single accelerator object. This makes it incredibly easy to leverage distributed training, without having to introduce all of the specialist code into your scripts; remaining reasonably transparent about how it is working under the hood.

As you may have guessed, I was an instant fan and adopted it almost immediately. However, by design, accelerate does not provide any other functionality, and requires users to write and maintain their own training loop. After a few use cases, I noticed that the majority of my accelerate scripts started to look very similar, and I began to take an increasing notice of that little voice in my head which nagged me that starting each new project by copy and pasting the code from my previous one probably wasn’t the best approach….

For me, this was the final push needed to consolidate these learnings and create my own library!

pytorch-accelerated is proudly and transparently built on top of Hugging Face Accelerate, which is responsible for the movement of data between devices, DeepSpeed integration, and launching of training configurations.

Now that all of that is out of the way, let’s jump in!

Getting Started: Training a Classifier on MNIST

Let’s start with a very simple example using MNIST, as would we really be doing deep learning if we didn’t?!

The first thing to do is to install the package. To help us get started, lets also include any packages that we may need to run the examples:

pip install pytorch-accelerated[examples]

Creating the training script

Now, let’s create our training script.

First, we need to download the data. As MNIST is practically the Deep Learning equivalent of Hello, World at this point, we can do this directly from torchvision; which even has a built in dataset for it!

import osfrom torch.utils.data import random_split
from torchvision import transforms
from torchvision.datasets import MNISTdataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())train_dataset, validation_dataset, test_dataset = random_split(
    dataset, [50000, 5000, 5000]
)

Here, we are using a standard transform to convert this data to PyTorch tensors, and then randomly split the dataset into train, validation and test sets.

Now, we have that the data ready to go, we need to agree on a model architecture, a loss function and an optimizer. As this is a very straightforward task, let’s define a simple feedforward neural net and use SGD with momentum as our optimizer. As this is a classification task, let’s use CrossEntropy as our loss function. Note that, as CrossEntropy includes a SoftMax computation, we don’t need to include this as part of our model architecture.

from torch import nn, optimclass MNISTModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.main = nn.Sequential(
            nn.Linear(in_features=784, out_features=128),
            nn.ReLU(),
            nn.Linear(in_features=128, out_features=64),
            nn.ReLU(),
            nn.Linear(in_features=64, out_features=10),
        )

    def forward(self, x):
        return self.main(x.view(x.shape[0], -1))model = MNISTModel()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
loss_func = nn.CrossEntropyLoss()

This is usually the point where we would have to start to think about writing our training loop. However, thanks to pytorch-accelerated, this is taken care of for us! To get started, all we have to do is import the Trainer.

from pytorch_accelerated import Trainertrainer = Trainer(
    model,
    loss_func=loss_func,
    optimizer=optimizer,
)

The Trainer is designed to encapsulate an entire training loop for a specific task, bringing together the model, loss function and optimizer, and providing a specification of the behaviour to execute for each step of the training process.

The main entry-point to the Trainer, is the train method, which will run the training an evaluation loop over the datasets that we provide. This is also where we set the specific configuration for the training run, such as how many epochs to train for and the batch size to use. This is also where we would manage any concerns such as learning rate scheduling, tweaking our DataLoaders, or accumulating gradients but as this is a simple example, we’ll skip those for now!

trainer.train(
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    num_epochs=2,
    per_device_batch_size=32,
)

Additionally, pytorch-accelerated supports distributed evaluation, so after our training run has finished, let’s evaluate the model on our test set. As we won’t be calculating any gradients during evaluation, let’s double the batch size.

trainer.evaluate(
    dataset=test_dataset,
    per_device_batch_size=64,
)

Now that we have seen all of the key steps needed, let’s combine all of these into a training script:

Launching training

Now that we have written our training script, all that is left to do is to launch training. As mentioned above, we will utilise Hugging Face accelerate for this. Accelerate provides the accelerate CLI, so that we have a consistent command to launch training irrespective of our underlying hardware setup.

To get started, we need to create a config for our training. This can be done in one of two ways; we can either store a local config for our system, or we can create a config file. For transparency, I prefer to use the latter. We can do this using the following command:

accelerate config --config_file train_mnist.yaml

This command will ask a series of questions, which will be used to generate the config file.

The output of the accelerate config command

Answering the questions as above produces the following config file:

train_mnist.yaml config file produced by `accelerate config — config_file train_mnist.yaml`

Note: To change this configuration, it is recommended to run the command again, rather than to edit this file directly!

Now, to launch training, we can use the command:

accelerate launch --config_file train_mnist.yaml train_mnist.py

This will produce an output similar to the following:

Output produced when running the training script

Exactly what is produced in terms of output can be customised depending on which callbacks are used. Here, we are just using the Trainer’s defaults.

That was all we needed to do to train a model using 2 GPUs!

For more info on execution order within the training loop, see: What goes on inside the Trainer?.

Tracking Metrics

As you may have noticed, by default, the only values tracked by the Trainer are the losses for each epoch. This is because the appropriate metrics depend heavily on the task! Let’s examine how we can track additional metrics.

To calculate our metrics, we are going to use torchmetrics, which are distributed training compatible, so that we won’t need to gather results from different processes before computing metrics.

There are two different approaches that we can take to do this:

Subclassing the Trainer
Using a callback

The decision of which approach to take largely comes down to the preference of the user. The following guidance is given in the docs:

It is recommended that callbacks are used to contain ‘infrastructure’ code, which is not essential to the operation of the training loop, such as logging, but this decision is left to the judgement of the user based on the specific use case.

As calculating metrics doesn’t impact our training code, it may be a good fit for a callback, and would mean that we don’t have to subclass the trainer. However, as callbacks are executed sequentially, we have to make sure that this callback would be called before metrics are printed!

Let’s compare both approaches.

Subclassing the Trainer

Whilst the Trainer should work out of the box in straightforward use cases, subclassing the trainer and overriding its methods is intended and encouraged - think of the base implementation as a set of ‘sensible defaults’!

The Trainer has lots of different methods that can be overridden, which are described in the documentation here. The main thing to remember is that Methods which are prefixed with a verb such as create or calculate expect a value to be returned, all other methods are used to set internal state (e.g. optimizer.step() )

Let’s create a subclass of the Trainer which tracks a set of classification metrics. Here, we can see that, at the end of each evaluation batch, we update our metrics — to avoid having to rewrite the actual evaluation logic, we have just called the default implementation for this — and then compute them at the end of each evaluation epoch.

The Trainer maintains a run history, which is used to track the losses by default, we can use this to track our calculated metrics. We could, of course, manage the tracking manually, but this approach has the added benefit that any metrics contained in the run history will be logged at the end of each epoch!

Using our new Trainer in our script, our script now looks like this:

Launching this using:

accelerate launch --config_file train_mnist.yaml train_with_metrics_in_loop.py

we can see that the metrics are printed at the end of each epoch!

Output produced when running the training script with metrics in the loop

Using a Callback

For such a small tweak, such as adding metrics, you may feel that it is slightly excessive to have to subclass the Trainer, and would prefer to use the base implementation. In this case, we can extend the behaviour of the default trainer using a callback.

To create a new callback, we can subclass TrainerCallback and override the relevant methods; these are described in the documentation here. To avoid confusion with the Trainer’s methods, all callback methods are prefixed with _on.

Let’s create a new callback to track our classification metrics:

Here, we can see that the code is almost identical to what we added to the Trainer in the previous example. The only difference, is that we need to manually move the metrics to the correct device at the start of a training or evaluation run. Thankfully, the Trainer makes this easy for us, by returning the correct device depending on the context!

All that we need to do to use this, is to include this in a list of callbacks that we pass to our Trainer at the time of creation. As we want to preserve the default behaviour, we include this before all of the default callbacks.

trainer = Trainer(
    model,
    loss_func=loss_func,
    optimizer=optimizer,
    callbacks=(
        ClassificationMetricsCallback(
            num_classes=num_classes,
        ),
        *DEFAULT_CALLBACKS,
    ),
)

We can now integrate this into our training script:

Launching this as before:

accelerate launch --config_file train_mnist.yaml train_with_metrics_in_callback.py

we will observe the same output as in the previous example.

Conclusion

Hopefully that has provided an introduction to pytorch-accelerated and enough information to demonstrate how to get started in your own use cases. To learn more, the documentation is available here and more complex training examples are available on GitHub. Don’t forget to add stars if you find it useful!

Look out for more posts to come covering how to tackle some more advanced use-cases, as well as how to apply some of my favourite PyTorch tips and tricks during training.

Acknowledgements

Many aspects behind the design and features of pytorch-accelerated were greatly inspired by a number of excellent libraries and frameworks such as fastai, timm, PyTorch-lightning and Hugging Face accelerate. Each of these tools have made an enormous impact on both this library and the machine learning community, and their influence can not be stated enough!

pytorch-accelerated has taken only inspiration from these tools, and all of the functionality contained has been implemented from scratch in a way that benefits this library. The only exceptions to this are some of the scripts in the examples folder in which existing resources were taken and modified in order to showcase the features of pytorch-accelerated; these cases are clearly marked, with acknowledgement being given to the original authors.

I would also like to thank all of my awesome CSE colleagues who have provided feedback on pytorch-accelerated as well as ideas on what a good solution should look like!

Chris Hughes is on LinkedIn.