“low angle photography of yellow hot air balloon” by sutirta budiman on Unsplash

Speed Up your Algorithms Part 1 — PyTorch

Speed Up your PyTorch models

Puneet Grover
Towards Data Science
8 min readSep 23, 2018


(Edit -28/11/18) — Added torch.multiprocessing section.


  1. Introduction
  2. How to check the availability of cuda?
  3. How to get more info on cuda devices?
  4. How to store Tensors and run Models on GPU?
  5. How to select and work on GPU(s) if you have multiple of them?
  6. Data Parallelism
  7. Comparison of Data Parallelism
  8. torch.multiprocessing
  9. References
This post goes with Jupyter Notebook available in my Repo on Github:[SpeedUpYourAlgorithms-Pytorch]

1. Introduction:

In this post I will show how to check, initialize GPU devices using torch and pycuda, and how to make your algorithms faster.

PyTorch is a Machine Learning library built on top of torch. It is backed by Facebook’s AI research group. After being developed recently it has gained a lot of popularity because of its simplicity, dynamic graphs, and because it is pythonic in nature. It still doesn’t lag behind in speed, it can even out-perform in many cases.

pycuda lets you access Nvidia’s CUDA parallel computation API from python.

2. How to check the availability of cuda?

“brown dried leaves on sand” by sydney Rae on Unsplash

To check if you have cuda device available using Torch you can simply run:

import torchtorch.cuda.is_available()
# True

3. How to get more info on your cuda devices?

“black smartphone” by rawpixel on Unsplash

To get basic info on devices, you can use torch.cuda. But to get more info on your devices you can use pycuda , a python wrapper around CUDA library. You can use something like:

import torch
import pycuda.driver as cuda
## Get Id of default device
# 0
cuda.Device(0).name() # '0' is the id of your GPU
# Tesla K80


torch.cuda.get_device_name(0) # Get name device with ID '0'
# 'Tesla K80'

I wrote a simple class to get information on your cudacompatible GPU(s):

To get current usage of memory you can use pyTorch's functions such as:

import torch# Returns the current GPU memory usage by 
# tensors in bytes for a given device
# Returns the current GPU memory managed by the
# caching allocator in bytes for a given device

And after you have run your application, you can clear your cache using a simple command:

# Releases all unoccupied cached memory currently held by
# the caching allocator so that those can be used in other
# GPU application and visible in nvidia-smi

However, using this command will not free the occupied GPU memory by tensors, so it can not increase the amount of GPU memory available for PyTorch.

These memory methods are only available for GPUs. And that’s where they are actually needed.

4. How to store Tensors and run Models on GPU?

The .cuda magic.

“five pigeons perching on railing and one pigeon in flight” by Nathan Dumlao on Unsplash

If you want to store something on cpu, you can simply write:

a = torch.DoubleTensor([1., 2.])

This vector is stored on cpu and any operation you do on it will be done on cpu. To transfer it to gpu you just have to do .cuda:

a = torch.FloatTensor([1., 2.]).cuda()


a = torch.cuda.FloatTensor([1., 2.])

And this will select the default device for it which can be seen by the command:

# 0

Or, you can also do:

# 0

You can also send a Model to the GPU device. For example consider a simple module made from nn.Sequential:

sq = nn.Sequential(
nn.Linear(20, 20),
nn.Linear(20, 4),

To send this to GPU device, simply do:

model = sq.cuda()

You can check if it is on GPU device or not, for that you will have to check if its parameters are on GPU or not, like:

# From the discussions here: discuss.pytorch.org/t/how-to-check-if-model-is-on-cudanext(model.parameters()).is_cuda
# True

5. How to select and work on GPU(s) if you have multiple of them?

“selective focus photography of mechanics tool lot” by NeONBRAND on Unsplash

You can select a GPU for your current application/storage which can be different from the GPU you selected for your last application/storage.

As already seen in part (2) we can get all our cuda compatible devices and their Id's using pycuda, we will not discuss that here.

Considering you have 3 cuda compatible devices, you can initialize and allocate tensors to a specific device like this:

cuda0 = torch.device('cuda:0')
cuda1 = torch.device('cuda:1')
cuda2 = torch.device('cuda:2')
# If you use 'cuda' only, Tensors/models will be sent to
# the default(current) device. (default= 0)
x = torch.Tensor([1., 2.], device=cuda1)
# Or
x = torch.Tensor([1., 2.]).to(cuda1)
# Or
x = torch.Tensor([1., 2.]).cuda(cuda1)
# If you want to change the default device, use:
torch.cuda.set_device(2) # where '2' is Id of device
# And if you want to use only 2 of the 3 GPU's, you
# will have to set the environment variable
# CUDA_VISIBLE_DEVICES equal to say, "0,2" if you
# only want to use first and third GPUs. Now if you
# check how many GPUs you have, it will show two(0, 1).
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,2"

When you do any operation on these Tensors, which you can do irrespective of the selected device, the result will be saved on the same device as the Tensor.

x = torch.Tensor([1., 2.]).to(cuda2)
y = torch.Tensor([3., 4.]).to(cuda2)
# This Tensor will be saved on 'cuda2' only
z = x + y

If you have multiple GPUs, you can split your application’s work among them, but it will come with a overhead of communication between them. But if your doesn’t need to relay messages too much, you can give it a go.

Actually there is one more problem. In PyTorch all GPU operations are asynchronous by default. And though it does make necessary synchronization when copying data between CPU and GPU or between two GPUs, still if you create your own stream with the help of the command torch.cuda.Stream() then you will have to look after synchronization of instructions yourself.

Giving a example from PyTorch's documentation, this is incorrect:

cuda = torch.device('cuda')
s = torch.cuda.Stream() # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
# because sum() may start execution before normal_() finishes!
B = torch.sum(A)

If you want to use multiple GPUs to its full potential, you can:

  1. use all GPUs for different tasks/applications,
  2. use each GPU for one model in an ensemble or stack, each GPU having a copy of data (if possible), as most processing is done during fitting to the model,
  3. use each GPU with sliced input and copy of model in each GPU. Each GPU will compute result separately and will send their results to a destination GPU where further computation will be done, etc.

6. Data Parallelism?

“photography of tree in forest” by Abigail Keenan on Unsplash

In data parallelism we split the data, a batch, that we get from Data Generator into smaller mini batches, which we then send to multiple GPUs for computation in parallel.

In PyTorch data parallelism is implemented using torch.nn.DataParallel.

But we will see a simple example to see what is going under the hood. And to do that we will have to use some of the functions of nn.parallel, namely:

  1. Replicate: To replicate Module on multiple devices.
  2. Scatter: To distribute the input in the first dimension among those devices.
  3. Gather: To gather and concatenate the input in first dimension from those devices.
  4. parallel_apply: To apply a set of distributed inputs, which we got from Scatter, to corresponding set of distributed Modules, which we got from Replicate.
# Replicate module to devices in device_ids
replicas = nn.parallel.replicate(module, device_ids)
# Distribute input to devices in device_ids
inputs = nn.parallel.scatter(input, device_ids)
# Apply the models to corresponding inputs
outputs = nn.parallel.parallel_apply(replicas, inputs)
# Gather result from all devices to output_device
result = nn.parallel.gather(outputs, output_device)

Or, simply:

model = nn.DataParallel(model, device_ids=device_ids)
result = model(input)

7. Comparison of Data Parallel

“silver bell alarm clock” by Icons8 team on Unsplash

I don’t have multiple GPU’s but I was able to find and a great post by here and his github repo comparing most frameworks using multiple GPUs here.

His results:

[last updated: (Jun, 19 2018)] i.e. his github repo. Launch of PyTorch 1.0, Tensorflow 2.0 and also new GPUs might have changed this …

So, as you can see Parallel Processing definitely helps even if has to communicate with main device in beginning and at the end. And PyTorch is giving results faster than all of them than only Chainer, only in multi GPU case. Pytorch makes it simple too by just one call to DataParallel.

8. torch.multiprocessing

Photo by Matthew Hicks on Unsplash

torch.multiprocessing is a wrapper around Python multiprocessingmodule and its API is 100% compatible with original module. So you can use Queue's, Pipe's, Array's etc. which are in Python’s multiprocessing module here. To add to that, to make it faster they have added a method, share_memory_(), which allows data to go into a state where any process can use it directly and so passing that data as argument to different processes won’t make copy of that data.

You can share Tensors, model’s parameters, and you can share them on CPU or GPU as you like.

Warning from Pytorch: (Regarding sharing on GPU)
CUDA API requires that the allocation exported to other processes remains valid as long as it’s used by them. You should be careful and ensure that CUDA tensors you shared don’t go out of scope as long as it’s necessary. This shouldn’t be a problem for sharing model parameters, but passing other kinds of data should be done with care. Note that this restriction doesn’t apply to shared CPU memory.

You can use methods above in “Pool and Process” section here, and to get more speedup you can use share_memory_() method to share a Tensor(say) among all processes without being copied.

# Training a model using multiple processes:import torch.multiprocessing as mp
def train(model):
for data, labels in data_loader:
loss_fn(model(data), labels).backward()
optimizer.step() # This will update the shared parameters
model = nn.Sequential(nn.Linear(n_in, n_h1),
nn.Linear(n_h1, n_out))
model.share_memory() # Required for 'fork' method to workprocesses = []
for i in range(4): # No. of processes
p = mp.Process(target=train, args=(model,))
for p in processes: p.join()

You can also work with a cluster of machines. For more info see here.

9. References:

  1. https://documen.tician.de/pycuda/
  2. https://pytorch.org/docs/stable/notes/cuda.html
  3. https://discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda
  4. https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
  5. https://medium.com/@iliakarmanov/multi-gpu-rosetta-stone-d4fa96162986
Suggestions and reviews are welcome.
Thank you for reading!




Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Puneet Grover
Puneet Grover

Responses (3)