Write Fast, Efficient, and Production-Ready PyTorch Deep Learning Models (Part 4)

Published in

PhysicsX

9 min readFeb 5, 2024

Part 4 / PyTorch: Production-Ready Deep Learning

PyTorch has played a crucial role in shaping the current state of AI, particularly in deep learning, making it arguably indispensable to its current position.

Originally PyTorch was more research-focused: offering a great Python API that allows researchers to iterate quickly on different architectures. But over the years it also became a tool capable of deploying deep learning models in a production environment efficiently.

However, in research environments, PyTorch code is often not optimised, sometimes slow, and inefficient. Making it hard to use in a production environment as it often requires complete refactoring of the code.

In this article, we will explore different concepts that are important to optimise PyTorch code as well as some best practices. These can be used by researchers to easily write more optimised code, or by engineers to properly refactor unoptimised models.

Vectorisation

As described previously, Python code is slow. To go around that, libraries leveraging faster programming languages like C or C++ can be used.

Most of the time, using these libraries would be done in the context of applying a specific operation to a vector of values instead of iteratively to each value. This operation on a vector still requires looping through each value one by one, but if this iteration is done in C++ (or any faster programming language) the speedup compared to Python will be significant.

This process is called vectorisation. The most common library when it comes to vectorising operations is numpy. It is a great library to play with multi-dimensional arrays, and it implements a lot of highly optimised operations. As such, it is common to see numpy being used in deep learning code.

However, using numpy with PyTorch code is actually a bad practice. Instead of numpy arrays, you should use PyTorch tensors, and instead of numpy operations, you should use their equivalent in PyTorch. In general, you should avoid mixing non-PyTorch operations with PyTorch code whenever it is possible.

The advantage of sticking to PyTorch code only is that you will be able to benefit from device accelerations (like GPU with CUDA as we will see later). It also reduces the chances of creating unnecessary copies of your data and finally makes it easier to ensure reproducibility.

Memory Management

We already talked about memory management in Python, this part will focus on the PyTorch specifics and it assumes that you are aware of the Python part.

When working with deep learning, you will often create models that can handle tensors with thousands if not millions of values. Whether it is to represent the parameters of your model, or simply the data going through it, it is often significant. You should always focus on reducing the number of data copies you make. Copying is a very expensive operation, especially when working with big datasets. When possible, you should modify your data in place instead, avoiding copies.

# Example of in-place operation and copy operation
# -------------------------------------------------------------------------

import torch

# -------------------------------------------------------------------------

# Example of in-place operation

tensor = torch.tensor([0, 1, 2, 3])
tensor += 1

# -------------------------------------------------------------------------

# Example of copy operation

tensor = torch.tensor([0, 1, 2, 3])
tensor = tensor + 1

# Here the result is the same, but a copy has been done,
#     it is an heavier and slower operation
# -------------------------------------------------------------------------

When indexing torch tensor be aware that advanced indexing will create a copy while basic indexing will create a view.

# Example of indexing, view and copy
# -------------------------------------------------------------------------

import torch

# -------------------------------------------------------------------------

# Example of basic indexing, creating a view

tensor = torch.tensor([0, 1, 2, 3])
subset = tensor[0:2]

subset[0] = -1

print(tensor)

# Output: tensor([-1,  1,  2,  3])

# Modifying the subset modifies the tensor: it's not a copy
# -------------------------------------------------------------------------

# Example of advanced indexing, creating a copy

tensor = torch.tensor([0, 1, 2, 3])
subset_index = torch.tensor([0, 1])
subset = tensor[subset_index]

subset[0] = -1

print(tensor)

# Output: tensor([0,  1,  2,  3])

# Modifying the subset does not modify the tensor: it is a copy
# -------------------------------------------------------------------------

You should also carefully consider how to get your data to PyTorch tensors. In some cases, you might get your original data from numpy arrays. The most natural way to get these arrays to torch tensors is to simply use the tensor constructor. But be aware that doing this will create a copy of your data. Instead prefer using the from_numpy(...) (and numpy(...)) functions.

# Example of data transfer between numpy and torch
# -------------------------------------------------------------------------

import numpy as np
import torch

# -------------------------------------------------------------------------

# Example copying the data

ndarray = np.array([0, 1, 2, 3])
tensor = torch.tensor(ndarray)

tensor[0] = -1

print(ndarray)

# Output: [0, 1, 2, 3]
# -------------------------------------------------------------------------

# Example of transfer without copy

ndarray = np.array([0, 1, 2, 3])
tensor = torch.from_numpy(ndarray)

tensor[0] = -1

print(ndarray)

# Output: [-1, 1, 2, 3]
# -------------------------------------------------------------------------

Finally, you should be aware that PyTorch supports the DLPack open data structure. This allows sharing data with different frameworks easily without performing any copies and should be used when possible.

Leverage GPU Using CUDA

We previously talked about GPGPU and how they can be useful when it comes to speeding up machine learning applications. One of the leaders of GPU hardware is NVIDIA. Along with its hardware, NVIDIA also maintains CUDA (Compute Unified Device Architecture), a proprietary programming interface allowing GPGPU programming with almost any NVIDIA GPU. We will focus on CUDA in this section as it is the most popular interface, but other interfaces can be used (for instance the Metal programming interface for Apple Silicon Chips).

PyTorch offers great support for CUDA and implements a lot of fully optimised kernels. It also makes use of some NVIDIA libraries like cuBLAS for high-performance linear algebra operations.

Running operations on CUDA is very easy, using the to(device=..., ...) function you can move your tensor to a CUDA device. Once that is done, every operation performed on these tensors will be executed on GPU. PyTorch models also implement the same function allowing to move the weights of a model to a CUDA device and run the model’s operations on GPU as well.

However, you should be careful when moving your data between the CPU and the GPU. This is an expensive operation, as such, you should reduce its use as much as possible. In some rare cases, the benefits and speed-ups gained from running your operations on the GPU will be smaller than the time lost to move the data.

Finally, remember that non-easily parallelisable operations might be slower on GPUs due to their inefficiency when running sequential operations. Just like when coding in pure Python, avoid for-loops as much as possible, and vectorise operations whenever you can.

Profiler and Benchmarking

When writing production-ready code it is important to test it. Unit tests and integration tests are good and important practices. But when it comes to deep learning models, it is also important to profile and benchmark your models.

Profiling consists of running specific analyses, which aim to measure the space and time complexity of your code. In the case of PyTorch models, profiling the code can help find bottlenecks, and understand how much memory is used, ultimately helping optimising the code. We previously mentioned that some operations can create copies of your data and that moving data from or to the GPU can be slow. These can easily be spotted when profiling the code. For this purpose, PyTorch provides a complete profiler that can be used to deeply analyse any PyTorch code.

# Example of PyTorch code profiling
# -------------------------------------------------------------------------

from torch.profiler import profile, ProfilerActivity

# -------------------------------------------------------------------------

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    # Insert PyTorch code to profile here

# You can then extract profiling information from the `prof` object
# -------------------------------------------------------------------------

Benchmarking on the other hand focuses more on measuring the performance of your code as a whole. It helps understand how well your code performs under different conditions. In the context of deep learning, it means analysing your models with different data inputs.

It also helps ensure that the performances of your model stay the same after modifying it. This concerns the accuracy of your model but also the running time and in some cases the memory usage.

Note that when benchmarking PyTorch code that is using CUDA, it can be tricky to measure the exact executing time as GPU operations are asynchronous. The following function should be used to make sure that the CUDA operations are finished: torch.cuda.synchronize().

# Example of PyTorch code benchmarking
# -------------------------------------------------------------------------

import torch
import time

# -------------------------------------------------------------------------

torch.cuda.synchronize()
start_time = time.time()

# Insert PyTorch code using CUDA to benchmark here

torch.cuda.synchronize()
end_time = time.time()

# -------------------------------------------------------------------------

Deep Learning Best Practices

We covered the most important concepts that should be addressed when writing production-ready PyTorch code. However, some best practices can be helpful in making sure that your code and models are easy to maintain.

First of all, PyTorch code should follow what has been described in the Software Engineering best practices article, specifically the section about Modularity and Reusability. When writing different deep learning layers, try to break them down into small PyTorch modules instead of one big module.

# Example showing how to break down a custom Module
# -------------------------------------------------------------------------

import torch
import torch.nn as nn
import torch.nn.functional as F

# -------------------------------------------------------------------------

# Single layer that could be broken down into smaller parts

class CustomLayer(nn.Module):
    def __init__(self) -> None:
        super().__init__()

        # Convolution
        self.convolution_1 = nn.Conv2d(1, 20, 5)
        self.convolution_2 = nn.Conv2d(20, 20, 5)

        # Multi-Layer Perceptron
        self.linear_1 = nn.Linear(20, 40)
        self.linear_2 = nn.Linear(40, 20)
        self.linear_3 = nn.Linear(20, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_convolution_1 = F.relu(self.convolution_1(x))
        x_convolution_2 = F.relu(self.convolution_2(x_convolution_1))

        x_linear_1 = F.relu(self.linear_1(x_convolution_2))
        x_linear_2 = F.relu(self.linear_2(x_linear_1))
        output = F.relu(self.linear_3(x_linear_2))

        return output

# -------------------------------------------------------------------------

# Cleaner implementation of the CustomLayer using a second sub-layer

class CustomMLP(nn.Module):
    def __init__(self, input_size: int, output_size: int) -> None:
        super().__init__()

        self.linear_1 = nn.Linear(input_size, 40)
        self.linear_2 = nn.Linear(40, 20)
        self.linear_3 = nn.Linear(20, output_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_linear_1 = F.relu(self.linear_1(x))
        x_linear_2 = F.relu(self.linear_2(x_linear_1))
        x_linear_3 = F.relu(self.linear_3(x_linear_2))

        return x_linear_3

class CustomLayer(nn.Module):
    def __init__(self) -> None:
        super().__init__()

        # Convolution
        self.convolution_1 = nn.Conv2d(1, 20, 5)
        self.convolution_2 = nn.Conv2d(20, 20, 5)

        self.mlp = CustomMLP(input_size=20, output_size=1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_convolution_1 = F.relu(self.convolution_1(x))
        x_convolution_2 = F.relu(self.convolution_2(x_convolution_1))

        output = self.mlp(x_convolution_2)

        return output

# -------------------------------------------------------------------------

When parametrising your models, try to remove every conditional statement or loop from the forward function. Instead, write these into the __init__ function of your module. Allowing these conditions to be executed only once at initialisation time instead of every time you run inference on your model.

# Example showing how to use conditional statement in a custom Module
# -------------------------------------------------------------------------

from typing import Literal, Callable

import torch
import torch.nn as nn
import torch.nn.functional as F

# -------------------------------------------------------------------------

# Layer implementation with a condition in the forward function

class CustomLayer(nn.Module):
    def __init__(
        self,
        activation_function: Literal["relu", "leaky_relu"],
    ) -> None:
        super().__init__()

        self.activation_function = activation_function

        self.linear = nn.Linear(1, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_linear = self.linear(x)

        if self.activation_function == "relu":
            output = F.relu(x_linear)
        elif self.activation_function == "leaky_relu":
            output = F.leaky_relu(x_linear)
        else:
            raise ValueError(
                f"Activation function {self.activation_function} invalid"
            )

        return output

# -------------------------------------------------------------------------

# Cleaner implementation: conditions have been moved to the constructor

class CustomLayer(nn.Module):
    def __init__(
        self,
        activation_function: Literal["relu", "leaky_relu"],
    ) -> None:
        super().__init__()

        self.activation_function: Callable[[torch.Tensor], torch.Tensor]
        if activation_function == "relu":
            self.activation_function = F.relu
        elif activation_function == "leaky_relu":
            self.activation_function = F.leaky_relu
        else:
            raise ValueError(
                f"Activation function {self.activation_function} invalid"
            )

        self.linear = nn.Linear(1, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_linear = self.linear(x)

        output = self.activation_function(x_linear)

        return output

# -------------------------------------------------------------------------

Finally, when documenting your PyTorch functions or modules, make sure to specify the shapes of your tensors. This significantly helps understanding the model’s code and makes it easier to debug.

# Example of documented custom Module
# -------------------------------------------------------------------------

import torch
import torch.nn as nn
import torch.nn.functional as F

# -------------------------------------------------------------------------

class CustomLayer(nn.Module):
    """CustomLayer class"""

    def __init__(self) -> None:
        """Constructor of the CustomLayer module."""

        super().__init__()

        self.layer = nn.Linear(1, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward function.

        Args:
            x: input data, shape: (batch_size, 1)

        Returns:
            output of the model, shape: (batch_size, 2)
        """
        output = F.relu(self.linear(x))

        return output

# -------------------------------------------------------------------------

Going Further

This series of articles focused on the most important concepts when it comes to making your PyTorch models production-ready. These ensure that your code is optimised, easy to maintain, and efficient. But sometimes these are not enough to reduce the execution time of your models. Here are some concepts that can help in that matter.

Automatic Mixed Precision (AMP): going from float32 to float16 significantly helps improve the running time and memory usage of your model while keeping a good accuracy.
JIT-compilation: using torch.compile(...) to compile PyTorch code into optimised kernels is a game-changing feature, combined with recent NVIDIA GPUs it can instrumentally speed up your models.
Quantization: PyTorch allows some computations to be performed on even lower bitwidths than floating point (float32 or float16). Using int8 (instead of float32), quantization divides the memory usage by 4 and can make your model run up to 4 times faster.

Useful Resources

Memory Management

DLPack’s Documentation
Understanding GPU Memory | PyTorch Article

Write Fast, Efficient, and Production-Ready PyTorch Deep Learning Models (Part 4)

Table of Contents

Part 4 / PyTorch: Production-Ready Deep Learning

Vectorisation

Memory Management

Leverage GPU Using CUDA

Profiler and Benchmarking

Deep Learning Best Practices

Going Further

Useful Resources

Memory Management

Written by Axen Georget