Comparison of PyTorch Dataset and TorchData DataPipes

Published in

Deelvin Machine Learning

5 min readNov 29, 2022

Deep neural networks can take a long time to train. The training speed is affected by the complexity of the architecture, batch size, the GPU used, the size of the training dataset, etc. It may turn out that calculations on the GPU are performed quickly enough, but the data batch necessary for gradient descent is formed too slowly. As a result, the GPU usage and training speed are low.

In PyTorch, the torch.utils.data.Dataset and torch.utils.data.DataLoader are being commonly used to load datasets and generate batches. Beginning with version 1.11, PyTorch introduced the TorchData library (beta version), which implemented a different approach for loading datasets.

In this article, I want to compare these approaches for classifying face images when a training sample does not fit into RAM. To do this, let’s take two datasets with images of the faces of CelebA and DigiFace1M. Table 1 shows their comparative characteristics. The ResNet-50 model is used for training. We will measure the time necessary for learning one epoch to compare approaches.

Table 1. Comparative characteristics of CelebA and DigiFace1M.

The experiments’ code is available on github.

All tests were performed on a computer with the following configuration:

CPU: Intel(R) Core(TM) i9–9900K CPU @ 3.60GHz (16 cores)
GPU: GeForce RTX 2080 Ti 12Gb
Driver version 515.65.01 / CUDA 11.7 / CUDNN 8.4.0.27
Docker version 20.10.21
Pytorch version 1.12.1
TrochData 0.4.1

The learning process is the same for all approaches:

Data uploading using Dataset

PyTorch supports two types of datasets: map-style Datasets and iterable-style Datasets. Map-style Dataset is convenient to use when the number of elements is known in advance. The __ getitem __ () and __ len__() methods are implemented for this class. If reading by the index is too expensive or even impossible, you can use iterable-style. The __iter__() method is implemented for this class. In our case, the map-style datasets are suitable since, for both CelebA and DigiFace1M datasets, we know the number of images in them.

Let’s create the CelebADataset class to describe the loading logic. For CelebA, class labels are in the identity_CelebA.txt file. Facial images in CelebA and DigiFace1M are different in their cropping, so to reduce these differences in the __getitem__ method once the image is uploaded, they have to be cropped slightly from all sides.

In DigiFace1M dataset, all images for the same class are in a separate folder. In these two databases, the classes have the same labels, so to divide them for the DigiFace1M database, we will not give the original class number, but an increased number by classes in CelebA. To do this, we enter the add_to_class variable. Images in DigiFace1M are stored in the “RGBA” format, so they still need to be converted to “RGB”.

Now we can combine two datasets into one using torch.utils.data.ConcatDataset, create DataLoader, and run training.

Data uploading using TorchData API

Like PyTorch Datasets, TorchData supports iterable-style and map-style DataPipes. It is recommended to use IterDataPipe and only, if necessary, to convert it to MapDataPipe.

TorchData provides well-optimized data-loading utilities that allow you to build flexible data pipelines. Here are some of them:

IterableWrapper. Wraps an iterable object to create an IterDataPipe.
FileLister. Given path(s) to the root directory, yields file pathname(s) (path + filename) of files within the root directory
Filter. Filters out elements from the source datapipe according to input filter_fn (functional name: filter)
Mapper. Applies a function over each item from the source DataPipe (functional name: map)
Concater. Concatenates multiple Iterable DataPipes (functional name: concat)
Shuffler. Shuffles the input DataPipe with a buffer (functional name: shuffle)
ShardingFilter. Wrapper that allows DataPipe to be sharded (functional name: sharding_filter)

In order to build a data pipeline for CelebA and DigiFace1M, we need to perform the following steps:

1. For CelebA dataset, create a list (file_name, label, ‘celeba’) and create an IterDataPipe from it using IterableWrapper

2. For DigiFace1M

Use FileLister to create an IterDataPipe that returns paths to all image files.
For each data pipeline item that now returns image paths, employ Mapper to use collate_ann. This function takes the image path as input and returns tuple (file_name, label, ‘DigiFace1M’).

3. After steps 1. and 2. we got two datapipe that return (file_name, label, data_name). Now they need to be connected into one datapipe using Concater.

4. Add Shuffler. Data will be shuffled only if shuffle=True is set in DataLoader.

5. Split datapipe into shards using ShardingFilter. In this case, each worker will have an n-th part of the elements of the original DataPipe, where n is equal to the number of workers.

6. At the very end of the data pipeline, add loading of images from the disk.

Importantly, Shuffler must be added before ShardingFilter in order for data to be shuffled globally, before splitting into shards. ShardingFilter must be placed before any expensive operations are performed to avoid their repetition on each worker.

Torch DataLoader supports both Datasets and DataPipe downloads.

How else can batch formation be accelerated?

One of the longest operations in batch formation is reading an image from the disk. To reduce the time spent on this operation, you can upload all images once and split them into small datasets, such as 10,000 images, each of which can be saved as a .pickle file. Before separating the data, they need to be mixed, since Shuffler in datapipe will not mix the images themselves, but .pickle datasets, which can negatively impact the process of learning convergence.

Now, to create a data pipeline for the prepared data, it is enough only with FileLister to collect all the paths to .pickle datasets, mix them, divide them by workers, and load .pickle data on each worker.

Summary

As a result, I compared three ways to download data for different numbers of workers. For all tests, batch_size = 600.

Table 2. Time spent on learning during one epoch for different approaches.

When training using DataPipe on unprepared data (without using pickle), I noticed that the first few hundred batches are formed very quickly, and the GPU use is almost 100%, but then there is noted a gradual decrease in speed, and as a result, this approach turned out to be even slower than using Datasets for n_workers=10. Although I expected about the same speed for these two approaches since, in fact, the same operations are being performed there.

The optimal number of workers for DataLoader can be totally different depending on the task (image size or shards, the complexity of image preprocessing) and computer configuration (HDD vs. SSD).

When training on datasets that have a huge number of small images, it is more optimal to do data preparation once, combining small files into several large ones to reduce the time for reading data from the disk. This method is good for DataPipes. However, you need to thoroughly mix the data before writing it to shards to avoid the learning convergence deterioration and choose a reasonable size of shards — it should be large enough to prevent disk problems and small enough to mix the data using Shuffler in DataPipes effectively.

This project was conducted by Deelvin. Check out our Deelvin Machine Learning blog for more articles on deep neural network training.

Comparison of PyTorch Dataset and TorchData DataPipes

Data uploading using Dataset

Data uploading using TorchData API

Summary

Written by Karina Ovchinnikova