Creating a custom Dataset and Dataloader in Pytorch

Vineeth S Subramanyam
Analytics Vidhya
Published in
9 min readJan 29, 2021

Training a deep learning model requires us to convert the data into the format that can be processed by the model. For example the model might require images with a width of 512, a height of 512, but the data we collected contains images with a width of 1280, and a height of 720. We therefore need some way to be able to convert the available data we have, into the exact format required by the model.

A dataloader in simple terms is a function that iterates through all our available data and returns it in the form of batches. For example if we have a dataset of 100 images, and we decide to batch the data with a size of 4. Our dataloader would process the data, and return 25 batches of 4 images each.

Creating a dataloader can be done in many ways, and does not require torch by any means to work. Using torch however makes the task a lot easier. Keeping that in mind, lets start by understanding what the the Torch Dataset and Dataloder Classes contains.

Torch Dataset:

  1. The Torch Dataset class is basically an abstract class representing the dataset. It allows us to treat the dataset as an object of a class, rather than a set of data and labels.
  2. The main task of the Dataset class is to return a pair of [input, label] every time it is called. We can define functions inside the class to preprocess the data, and return it in the format we require.
  3. The class must contain two main functions:
    __len__(): This is a fuction that returns the length of the dataset.
    __getitem__(): This is a function that returns one training example.
  4. The torch dataset class can be imported from torch.utils.data.Dataset

Torch Dataloader:

  1. The Torch Dataloader not only allows us to iterate through the dataset in batches, but also gives us access to inbuilt functions for multiprocessing(allows us to load multiple batches of data in parallel, rather than loading one batch at a time), shuffling, etc.
  2. The torch Dataloader takes a torch Dataset as input, and calls the __getitem__() function from the Dataset class to create a batch of data.
  3. The torch dataloader class can be imported from torch.utils.data.DataLoader

Code:

  • I wont go into the entire process of training a model, but I will explain step by step, the process of creating the dataset class, and the dataloader.
  • The requirements for the code will be:
    numpy: pip3 install numpy
    opencv: pip3 insall opencv-python
    torch: pip3 install torch
    glob: pip3 install glob
  • I have created a sample dataset for the task of a classification model, to classify between cats and dogs. The folder structure is as follows. We have the Project folder that contains the code Main.py, and a folder called Dog_Cat_Dataset. This folder called Dog_Cat_Dataset is the dataset folder that contains 2 subfolders inside it called dogs and cats. Both the dogs and cats folders have 5 images each.
- Project
- Main.py
- Dog_Cat_Dataset
- dogs
- 1.jpg
- 2.jpg
- 3.jpg
- 4.jpg
- 5.jpg
- cats
- 1.jpg
- 2.jpg
- 3.jpg
- 4.jpg
- 5.jpg
  • Lets start the code Main.py with a few imports.
import glob
import cv2
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
  • glob: allows us to retrieve paths of data inside sub folders easily
    cv2: is used as the image processing library to read and preprocess images
    numpy: is used for matrix operations
    torch: is used to create the Dataset and Dataloader classes, and for converting data to tensors.
class CustomDataset(Dataset):
def __init__(self):
self.imgs_path = "Dog_Cat_Dataset/"
file_list = glob.glob(self.imgs_path + "*")
print(file_list)
self.data = []
for class_path in file_list:
class_name = class_path.split("/")[-1]
for img_path in glob.glob(class_path + "/*.jpeg"):
self.data.append([img_path, class_name])
print(self.data)
self.class_map = {"dogs" : 0, "cats": 1}
self.img_dim = (416, 416)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
img_path, class_name = self.data[idx]
img = cv2.imread(img_path)
img = cv2.resize(img, self.img_dim)
class_id = self.class_map[class_name]
img_tensor = torch.from_numpy(img)
img_tensor = img_tensor.permute(2, 0, 1)
class_id = torch.tensor([class_id])
return img_tensor, class_id
  • Lets go through this class in detail.
class CustomDataset(Dataset):
  • We create a class called CustomDataset, and pass the argument Dataset, to allow it to inherit the functionality of the Torch Dataset Class.
def __init__(self):
self.imgs_path = "Dog_Cat_Dataset/"
file_list = glob.glob(self.imgs_path + "*")
print(file_list)
  • We define the init function to initialize our variables. The variable self.imgs_path contains the base path to our Dog_Cat_Dataset folder
  • We then use glob to retrieve the list of all folders inside the base folder we specified. In our example, the Dog_Cat_Dataset folder contains two subfolders called dogs and cats. The “ * ” term added to the self.imgs_path indicates we want to search for all folders or files inside the specified path.(Note that using the “*” operation would return every single file inside the folder, not only the folder names. So make sure there is nothing other than the dogs and cats folders in this location.)
  • The print statement would return the following output:
    [‘Dog_Cat_Dataset/dogs’, ‘Dog_Cat_Dataset/cats’]
  • This file_list variable contains the list of all the classes in the dataset(dogs and cats).
self.data = []
for class_path in file_list:
class_name = class_path.split("/")[-1]
for img_path in glob.glob(class_path + "/*.jpeg"):
self.data.append([img_path, class_name])
print(self.data)
  • We now start creating the data list, that would contain the paths to all the images in our dataset.
  • We iterate over all the classes in our file list (dogs and cats), and for each class, we first start by extracting the actual class name. As you could see from the printed file list, each class was represented with respected to its base path; i.e Dog_Cat_Dataset/dogs. Since we would much rather refer to the class as dogs and cats, rather than with respect to its path, we create a class_name variable by splitting this class_path by “/”. (Splitting by “/” would return a list [“Dog_Cat_Dataset” , “dogs”]. Taking the [-1] index would use the last entry in the list. In this case it would be “dogs”.
  • Now that we are iterating through each class of the dataset(dog and cat), we want to retrieve each image in their folders (i.e 1.jpeg, 2.jpeg, etc). To do this we once again use glob to return all files in the folders with the extension “.jpeg”. The complete string we pass to glob is Dog_Cat_Dataset/dogs/*.jpeg .The “*.jpeg” indicates we want every file which has an extension of “.jpeg” .
  • We append the file path for each image to the list self.data, along with its corresponding class name. This gives us a way to retrieve the input image along with its corresponding label. Printing the list would return the following output.
 [['Dog_Cat_Dataset/dogs/4.jpeg', 'dogs'],    
['Dog_Cat_Dataset/dogs/3.jpeg', 'dogs'],
['Dog_Cat_Dataset/dogs/5.jpeg', 'dogs'],
['Dog_Cat_Dataset/dogs/1.jpeg', 'dogs'],
['Dog_Cat_Dataset/dogs/2.jpeg', 'dogs'],
['Dog_Cat_Dataset/cats/4.jpeg', 'cats'],
['Dog_Cat_Dataset/cats/3.jpeg', 'cats'],
['Dog_Cat_Dataset/cats/5.jpeg', 'cats'],
['Dog_Cat_Dataset/cats/1.jpeg', 'cats'],
['Dog_Cat_Dataset/cats/2.jpeg', 'cats']]
  • We additionally define a class map, and an image dimension. The self.class_map dictionary that allows us to convert the string of the classes, to a number; i.e “dogs” corresponds to class 0, and “cats” corresponds to class 1, and the image dimension is the size that we will resize all the images to, so that they all have the same size.
self.class_map = {"dogs" : 0, "cats": 1}
self.img_dim = (416, 416)
  • Now that we have created a list with all our data, we start coding the function for __len__(), which is mandatory for a Torch Dataset object.
def __len__(self):
return len(self.data)
  • The size of our dataset is just the number of individual images we have, which can be obtained through the length of the self.data list. (Torch internally uses this function to understand the size of the dataset in its dataloader, to call the __getitem__() function with an index within this dataset size)
def __getitem__(self, idx):
img_path, class_name = self.data[idx]
  • We start by defining the __getitem__ function to take the object of itself as input(self), and an idx. The idx is the term that this function will be called with, that corresponds to which image needs to be returned from our self.data list.
  • We retrieve the image path and class name corresponding to the idx from the self.data list.
img = cv2.imread(img_path)
img = cv2.resize(img, self.img_dim)
class_id = self.class_map[class_name]
  • We use opencv as the image processing library to load the image and resize it to the required dimension.
  • We read the image using the imread function that loads an image from the given image path, and then resize it to the dimensions we specified in the self.img_dim variable.
  • Deep learning models for classification generally make use of a number id associated with the class, rather than a name (“dogs” -> 0). The self.class_map dictionary just provides the mapping from the name to the number.
img_tensor = torch.from_numpy(img)
img_tensor = img_tensor.permute(2, 0, 1)
class_id = torch.tensor([class_id])
  • Once we have loaded the image, and obtained its corresponding class id, we convert the variables to tensors. Training models with torch requires us to convert variables to the torch tensor format, that contain internal methods for calculating gradients, etc.
  • We create the variable img_tensor that is the tensor form of the img we loaded. Opencv uses the library numpy to represent images as matrices, and the torch.from_numpy function allows us to convert a numpy array to a torch tensor.
  • Torch convolutions require images to be in a channel first format; i.e for example a 3 channel image(Red, Green and Blue channels) would be generally represented as: (Width, Height, Channels) in numpy, however torch requires us to convert this to: (Channels, Width, Height).
  • For this conversion we use the permute function of torch, that allows us to change the ordering of the dimensions of a torch tensor. The arguments we pass to it, correspond to the new ordering of dimensions we want. For example in our case, we have (Width, Height, Channels).
    (Width -> 0), (Height->1), (Channels->2)
    We want to reorder these dimensions to make channels first, therefore, we use img_tensor.permute(2, 0, 1), which would make the 2nd dimension first.
  • We then convert the integer value of class_id to a torch tensor, and also increase its dimensionality by refering to it as [class_id]. This is to ensure that the data can be batched in the dimensions torch requires it. (Torch requires labels to be in the shape [batch_size, label_dimension]. Using just class_id, rather that [class_id] woud lead to us having a final size of [batch_size], as each class_id is just a single value).
return img_tensor, class_id
  • We return the img_tensor and class_id, so that anytime the __getitem__ function is called with an idx, it is returned an image with its corresponding label.
  • Note: Additionally torch might require the tensors returned to be converted to type float. That can be done with the following modification:
return img_tensor.float(), class_id.float()
  • Now that we have created the dataset, we can use this class in the torch dataloader to iterate through this data while training.
if __name__ == "__main__":
dataset = CustomDataset()
data_loader = DataLoader(dataset, batch_size=4, shuffle=True)
  • To test out the dataset and our dataloader, in the main function of our script, we create an instance of the CustomDataset we created, and call it dataset.
  • The Torch DataLoader takes as input this dataset, along with other arguments for batch_size, shuffle, etc, which are extra terms to specify how many images we want in each batch of data, and whether we want to randomize the way in which we generate the idx, we used before.
for imgs, labels in data_loader:
print("Batch of images has shape: ",imgs.shape)
print("Batch of labels has shape: ", labels.shape)
  • We can test the working of our dataloader by iterating through it, and printing the shape of our input images, and our labels.
  • The torch dataloader does take a lot of extra arguments like :
  1. no_workers: Corresponds to how many workers in parallel we want to load the data. Using more number of workers would speed up the data loading process.
  2. pin_memory: When training on a GPU device, we would want to speed up the process of passing the loaded data from CPU to GPU. Therefore setting it to true would speed up this data transfer.
  3. drop_last: It skips the last batch of data, if it has samples < batch_size. For example in our case, we have 10 images in our dataset, and we specified a batch size of 4. This would mean that the last batch has only 2 images. Setting drop_last=True, would skip this last batch, as it has < 4 images.

Note:

  1. The dataset class of torch can be used like any other class in python, and have any number of sub functions in it, as long as it has the 2 required functions(__len__, and __getitem__).
  2. The returned data can also be passed to GPU, if available. For example: img_tensor.to(“cuda”).
  3. The torch dataloader has an additional list of arguments that can be used
    Link
  4. I have uploaded the complete code for this post on github: Code

--

--