Improving Control and Reproducibility of PyTorch DataLoader with Sampler Instead of Shuffle Argument

hengtao tantai
8 min readFeb 22, 2023

DataLoader is a class that provides an iterable over a given dataset in Pytorch, which can be used to efficiently load data in parallel during training or testing of a neural network model.

The shuffle argument in PyTorch's DataLoader constructor is a boolean value that indicates whether the data should be shuffled before each epoch. By default, it is set to False. When shuffle is set to True, the data is randomly shuffled before each epoch, so the order of the data is different in each epoch.

Shuffling the data can be useful in training deep learning models because it helps prevent the model from memorizing the order of the data. If the data is not shuffled, the model may learn to predict the next sample based on the order of the data, rather than learning the underlying patterns in the data.

However, shuffling the data can be computationally expensive for large datasets, and it can also make the training process less reproducible. To address these issues, PyTorch provides several built-in implementations of the Sampler class, which can be used instead of the shuffle argument to control how the data is sampled.

Photo by Julian Hochgesang on Unsplash

What is Sampler in Dataloader

A Sampler is an object that defines the strategy for sampling elements from the dataset that the DataLoader will use.

There are different types of samplers available in PyTorch. Some commonly used samplers are:

  1. SequentialSampler: Samples elements from the dataset sequentially.
  2. RandomSampler: Samples elements from the dataset randomly without replacement.
  3. SubsetRandomSampler: Samples elements randomly from a given list of indices.
  4. WeightedRandomSampler: Samples elements from the dataset with given weights.
  5. DistributedSampler:Samples elements for distributed training

Use sampler instead of the shuffle

It is generally recommended to use a sampler instead of the shuffle argument when creating a DataLoader , especially when you want more control over the sampling process. Here are some reasons why you might want to use a sampler instead:

  1. Reproducibility: When you use the shuffle argument, the order of the data can be different each time you run your code, which can make it difficult to reproduce your results. Using a sampler with a fixed seed allows you to control the order of the data, making your results more reproducible.
  2. Customization: With a sampler, you can define your own sampling strategy, such as sampling a fixed subset of the data, or sampling the data with a certain probability distribution. This can be useful in scenarios where you have imbalanced classes or want to implement a custom sampling strategy.
  3. Flexibility: When you use the shuffle argument, the data is shuffled before each epoch, which can be computationally expensive for large datasets. With a sampler, you have more control over how often the data is shuffled, which can be useful for optimizing performance.

Here is an example of how to create a DataLoader with a sampler instead of the shuffle argument:

from torch.utils.data import DataLoader, SubsetRandomSampler
dataset = MyDataset()

# create a sampler that samples a fixed subset of the data
indices = [0, 1, 2, 3, 4]
sampler = SubsetRandomSampler(indices)

# create a DataLoader with the sampler
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

Note that when using a Sampler with a DataLoader, the shuffle argument should be set to False, as the sampler already specifies the sampling strategy.

SequentialSampler

SequentialSampler is a sampler that samples elements from a dataset sequentially. It simply returns the indices of the dataset elements in sequential order, starting from index 0.

SequentialSampler is the default sampler used in a DataLoader if no sampler is specified. This means that by default, the elements in the dataset will be returned in sequential order.

Here’s an example of how to use SequentialSampler with a DataLoader:

from torch.utils.data import DataLoader, SequentialSampler
from my_dataset import MyDataset

my_dataset = MyDataset()
batch_size = 32

dataloader = DataLoader(
dataset=my_dataset,
batch_size=batch_size,
sampler=SequentialSampler(my_dataset),
num_workers=4,
pin_memory=True
)

When using SequentialSampler with a DataLoader, the elements of the dataset will be returned in sequential order, which may not be suitable for some types of training or testing. For example, if the dataset has some inherent ordering, such as time-series data, then using a SequentialSampler may not be appropriate. In such cases, other samplers like RandomSampler or SubsetRandomSampler may be more appropriate.

RandomSampler

RandomSampler samples elements randomly from a given dataset without replacement. When used with a DataLoader, the RandomSampler shuffles the indices of the dataset at the beginning of each epoch, and then uses these shuffled indices to create batches of data.

The RandomSampler has the following parameters:

  • data_source (required): The dataset to sample from.
  • replacement (default: False): If True, samples are drawn with replacement.
  • num_samples (default: None): If specified, the sampler will draw num_samples random samples from the dataset with or without replacement (depending on the value of replacement).
  • generator (default: None): A random number generator used to sample indices. By default, the global random generator provided by the torch.random module is used.

The data_source parameter is required, and it should be set to the dataset from which you want to sample. The replacement parameter controls whether or not to sample with replacement. When replacement=True, the sampler samples with replacement, which means that the same index can be selected multiple times. When replacement=False, the sampler samples without replacement, which means that each index is selected only once.

The num_samples parameter allows you to specify the number of samples to draw from the dataset. If num_samples is not specified, the sampler will sample all the indices in the dataset.

The generator parameter allows you to specify a random number generator to use when sampling indices. By default, the global random generator provided by the torch.random module is used.

Samples elements randomly. If without replacement, then sample from a shuffled dataset. If with replacement, then user can specify num_samples to draw.

SubsetRandomSampler

The SubsetRandomSampler samples a random subset of indices from the list of indices without replacement. This ensures that the same index is not sampled multiple times within the same batch, which is important for avoiding duplicates in the training data.

SubsetRandomSampler only hive 2 Parameters:

  • indices (sequence) — a sequence of indices
  • generator (Generator) — Generator used in sampling.
from torch.utils.data import DataLoader, SubsetRandomSampler
dataset = MyDataset()

# create a sampler that samples a fixed subset of the data
indices = [0, 1, 2, 3, 4]
sampler = SubsetRandomSampler(indices)

# create a DataLoader with the sampler
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

WeightedRandomSampler

WeightedRandomSampler samples elements from a dataset with given weights. It is useful when you have an imbalanced dataset, where some classes or samples are more common than others, and you want to ensure that your model is trained on a representative subset of the data.

from torch.utils.data import DataLoader, WeightedRandomSampler
from my_dataset import MyDataset

my_dataset = MyDataset()
batch_size = 32

# Create a list of weights for each data element
weights = [0.2, 0.3, 0.1, 0.1, 0.3] # Example weights, should sum to 1

sampler = WeightedRandomSampler(
weights=weights,
num_samples=len(my_dataset),
replacement=True
)

dataloader = DataLoader(
dataset=my_dataset,
batch_size=batch_size,
sampler=sampler,
num_workers=4,
pin_memory=True
)

weights argument should be a list of weights for each data element in the dataset. The sampler will use these weights to determine the probability of selecting each element when sampling. The num_samples argument specifies the number of samples to draw, and replacement=True indicates that sampling with replacement is allowed.

how to calculate weights in WeightedRandomSampler

One common approach is to use the inverse of the frequency of each class or sample in the dataset. Here’s an example of how to calculate the weights for a dataset with two classes:

from torch.utils.data import WeightedRandomSampler
from my_dataset import MyDataset

my_dataset = MyDataset()
class_counts = [0, 0] # Initialize counts for each class

# Count the number of samples in each class
for sample in my_dataset:
class_counts[sample["label"]] += 1

# Calculate the weight for each sample
weights = [1.0 / class_counts[sample["label"]] for sample in my_dataset]

# Create a sampler with the calculated weights
sampler = WeightedRandomSampler(
weights=weights,
num_samples=len(my_dataset),
replacement=True
)

In this example, we loop through each sample in the dataset and increment the count for the corresponding class. This gives us the number of samples in each class. Then calculate the weight for each sample by taking the inverse of the class count for the sample’s class. Finally,create a WeightedRandomSampler object with the calculated weights.

how to get different data in each epoch with WeightedRandomSampler

Set the replacement parameter to True. By default, WeightedRandomSampler samples elements from the dataset without replacement, which means that each element is selected exactly once in the sampling process.

When replacement is set to True, elements can be selected more than once, which can result in different samples being selected in each epoch.

WeightedRandomSampler with replacement to get different data in each epoch:

import torch
from torch.utils.data import WeightedRandomSampler, DataLoader

# create a toy dataset with imbalanced class distribution
class_a = torch.randn(100, 3)
class_b = torch.randn(10, 3)
data = torch.cat([class_a, class_b], dim=0)
labels = [0] * len(class_a) + [1] * len(class_b)

# compute class weights based on the frequency of each class in the dataset
class_counts = torch.bincount(torch.tensor(labels))
class_weights = 1.0 / class_counts.double()

# create a weighted random sampler with replacement
sampler = WeightedRandomSampler(weights=class_weights, num_samples=len(data), replacement=True)

# create a dataloader using the weighted random sampler
dataloader = DataLoader(dataset=torch.utils.data.TensorDataset(data, torch.tensor(labels)),
batch_size=10,
sampler=sampler)

# iterate over the dataloader for multiple epochs
for epoch in range(3):
print(f"Epoch {epoch+1}:")
for batch_idx, (batch_data, batch_labels) in enumerate(dataloader):
print(f"Batch {batch_idx}: {batch_labels}")

we first create a toy dataset with imbalanced class distribution,100 samples from class A and 10 samples from class B. And then compute the class weights based on the frequency of each class in the dataset. Class A has a weight of 0.01, class B has a weight of 0.1, reflecting the fact that class B is underrepresented in the dataset.

Create WeightedRandomSampler with replacement, passing in the computed class weights and the total number of samples in the dataset. DataLoader using the WeightedRandomSampler and iterate over it for three epochs. In each epoch, print the labels of the samples in each batch and show different samples as follow:

Epoch 1:
Batch 0: tensor([0, 1, 0, 0, 0, 0, 1, 1, 1, 0])
Batch 1: tensor([1, 0, 0, 0, 1, 0, 0, 0, 1, 0])
Batch 2: tensor([0, 1])
Epoch 2:
Batch 0: tensor([0, 0, 1, 1, 0, 0, 1, 0, 0, 0])
Batch 1: tensor([1, 0, 0, 1, 1, 0, 1, 0, 0, 0])
Batch 2: tensor([0, 1])
Epoch 3:
Batch 0: tensor([0, 0, 0, 1, 1, 0, 0, 0, 1, 0])
Batch 1: tensor([0, 0, 1, 1, 0, 0, 0, 1, 0, 0])
Batch 2: tensor([1, 0])

different samples are being selected in each epoch, indicating that WeightedRandomSampler is sampling with replacement.

DistributedSampler

torch.utils.data.distributed.DistributedSampler implements a sampler for distributed training of machine learning models. It is typically used in conjunction with the DataLoader class to load batches of data for training a machine learning model in a distributed setting.

In distributed training, the data is partitioned across multiple processes, with each process working on a different subset of the data. The DistributedSampler ensures that each process samples a non-overlapping subset of the data, even if the data is not evenly divisible by the number of processes.

Conclusion

The Sampler class is typically used in conjunction with the DataLoader class to load data in batches. By passing a Sampler instance to the sampler argument of the DataLoader constructor, you can control how data is sampled from the dataset.

PyTorch’s Sampler class provides a flexible and extensible way to control how data is sampled from a dataset, making it easy to customize the training process to your specific needs.

It is recommended to use a sampler instead of the shuffle argument when creating a DataLoader, especially when you want more control over the sampling process. This can improve the reproducibility and efficiency of your training process.

If this article is helpful to you, please clap👏 and follow me😊.

--

--

hengtao tantai

Independent Researcher.I post the AI content that I am interested in.Hope you like it too