Building Custom Datasets for PyTorch Deep Learning Image Classification

9 min readNov 22, 2022

Check out the full PyTorch implementation on the dataset in my other articles (pt.1, pt.2).

Introduction

After some time using built-in datasets such as MNIS and CIFAR, which are loaded directly from common machine learning frameworks, you have practiced building you first deep learning image classifiers. It is a natural step to aim for using your own datasets. Maybe you have a very specific use case at work and want to train a custom model based on your company image database. Or you just want to practice with other use cases with images scraped from the internet. Luckily, you can (rather) easily do so with PyTorch just by organizing your images neatly in folders.

This article uses the Intel Image Classification dataset, which can be found here. Once downloaded, the images of the same class are grouped inside the folder named after the class (e.g., “buildings”, “forest”...), and these class labels are consistent in the training and testing sets. This is a common way of organizing images into folders in practice, and hence serves as a good sample dataset. Here, everything is placed in the parent folder called data , as seen in the screenshot below.

2. Annotation File Preparation

Firstly, we define the folder directories that contain the training and testing data.

train_folder = r'.\data\seg_train'
test_folder = r'.\data\seg_test'

To construct the custom dataset later, it is useful to find a way to organize the images into an annotation file, so that we can use it to instruct PyTorch that a certain image with a specific name belongs to a specific class. It is of course not practical to rename all the images in a class folder above to the same class name, with some numerical ID to differentiate amongst themselves. Hence we will make use of the folder name as the class name, and the image full path will serve as the unique address to the image.

This can be done with the function build_csv below. The explanation for each step is included in the comments. Essentially the function starts off by writing the column labels to a blank csv file, then goes through each folder in the specified directory (e.g., training folder) and fetches the sub-folder names as class names. All images’ full paths together with their corresponding class names (and indices) are then written to the csv file row by row.

def build_csv(directory_string, output_csv_name):
    """Builds a csv file for pytorch training from a directory of folders of images.
    Install csv module if not already installed.
    Args: 
    directory_string: string of directory path, e.g. r'.\data\train'
    output_csv_name: string of output csv file name, e.g. 'train.csv'
    Returns:
    csv file with file names, file paths, class names and class indices
    """
    import csv
    directory = directory_string
    class_lst = os.listdir(directory) #returns a LIST containing the names of the entries (folder names in this case) in the directory.
    class_lst.sort() #IMPORTANT 
    with open(output_csv_name, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        writer.writerow(['file_name', 'file_path', 'class_name', 'class_index']) #create column names
        for class_name in class_lst:
            class_path = os.path.join(directory, class_name) #concatenates various path components with exactly one directory separator (‘/’) except the last path component. 
            file_list = os.listdir(class_path) #get list of files in class folder
            for file_name in file_list:
                file_path = os.path.join(directory, class_name, file_name) #concatenate class folder dir, class name and file name
                writer.writerow([file_name, file_path, class_name, class_lst.index(class_name)]) #write the file path and class name to the csv file
    return

build_csv(train_folder, 'train.csv')
build_csv(test_folder, 'test.csv')
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

After defining this helper function, we create two csv annotation files, one for training and one for testing. Then they are re-imported as two dataframes for downstream steps. The output csv looks something like this.

You can consider returning the dataframe directly from the function if it’s more convenient. Using csv is just providing more convenience if you want to pass the data and the annotation files elsewhere, and writing csv file seems more straightforward than working with Pandas dataframes.

Note that we deliberately write the full paths of image files as it is needed to load images into the dataset class later.

Another important note is highlighted as IMPORTANT in the sorting of the class list in the function. This step is necessary because the class list produced by the os.listdir is arbitrary, and can yield different order when applied to train and test sets, and hence the index of a class in the list may not be consistent between the training and testing dataset.

To be cautious of the previous important note, we can zip the class names and indices, then compare between the training and testing datasets. As you can see, the class names and their indices are consistent in these two sets.

class_zip = zip(train_df['class_index'], train_df['class_name'])
my_list = []
for index, name in class_zip:
  tup = tuple((index, name))
  my_list.append(tup)
unique_list = list(set(my_list))
print('Training:')
print(sorted(unique_list))
print()

class_zip = zip(test_df['class_index'], test_df['class_name'])
my_list = []
for index, name in class_zip:
  tup = tuple((index, name))
  my_list.append(tup)
unique_list = list(set(my_list))
print('Testing:')
print(sorted(unique_list))

The outputs show that the class names and indices are consistent.

Training:
[(0, 'buildings'), (1, 'forest'), (2, 'glacier'), (3, 'mountain'), (4, 'sea'), (5, 'street')]

Testing:
[(0, 'buildings'), (1, 'forest'), (2, 'glacier'), (3, 'mountain'), (4, 'sea'), (5, 'street')]

Finally, we can just take out the class names in either the train or test dataset (preferably train set as it should contain all classes) to serve as the final reference.

class_names = list(train_df['class_name'].unique())
['buildings', 'forest', 'glacier', 'mountain', 'sea', 'street']

3. Creating Customized Training And Testing Datasets

After taking care of the annotation files, we will build custom training and testing dataset with the Dataset class in torch.utils.data. From the documentation tutorial (link), Dataset is an abstract class representing a dataset. To build a custom dataset, we should inherit Dataset itself and override its methods to customize to our use case.

Again, the explanation for each step is included in the comments. In summary, we will pass two important arguments to the __init__ method: the annotation csv file and the image transformation (more of image transformation later).

class IntelDataset(torch.utils.data.Dataset): # inheritin from Dataset class
    def __init__(self, csv_file, root_dir="", transform=None):
        self.annotation_df = pd.read_csv(csv_file)
        self.root_dir = root_dir # root directory of images, leave "" if using the image path column in the __getitem__ method
        self.transform = transform

    def __len__(self):
        return len(self.annotation_df) # return length (numer of rows) of the dataframe

    def __getitem__(self, idx):
        image_path = os.path.join(self.root_dir, self.annotation_df.iloc[idx, 1]) #use image path column (index = 1) in csv file
        image = cv2.imread(image_path) # read image by cv2
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert from BGR to RGB for matplotlib
        class_name = self.annotation_df.iloc[idx, 2] # use class name column (index = 2) in csv file
        class_index = self.annotation_df.iloc[idx, 3] # use class index column (index = 3) in csv file
        if self.transform:
            image = self.transform(image)
        return image, class_name, class_index

The end result is that for each instance of the customized class, we will have the transformed image itself, and its corresponding class name and index according to the annotation file prepared above.

Note that here, we make use of the full image paths for reading the images with CV2. When using CV2, it is important to note that it reads image pixel arrays in the order of Blue-Green-Red (BGR) and not the normal RGB order. Thus, the conversion from BGR to RGB is included for downstream steps where RGB is used. More on CV2 and its comparison with PIL can be found in the link below.

Image Processing — OpenCV Vs PIL

Utilize the Python Libraries to Extract information from Images!

towardsdatascience.com

Another note is that we will leave the `root_dir` argument blank (empty string) as we already store the image full paths in the annotation file.

Hence you can see that preparing a good annotation file helps keep everything we need neatly in place.

Now, we have finished the necessary steps to build our own dataset. Let’s create a sample untransformed train dataset from the class above, and visualize some random images with their class names and indices in the iterate dataset.

#test dataset class without transformation:
train_dataset_untransformed = IntelDataset(csv_file='train.csv', root_dir="", transform=None)

#visualize 10 random images from the loaded dataset
plt.figure(figsize=(12,6))
for i in range(10):
    idx = random.randint(0, len(train_dataset_untransformed))
    image, class_name, class_index = train_dataset_untransformed[idx]
    ax=plt.subplot(2,5,i+1) # create an axis
    ax.title.set_text(class_name + '-' + str(class_index)) # create a name of the axis based on the img name
    plt.imshow(image) # show the img

4. Image Transformation

A common practice for image classification tasks is transforming input images. Transforming is the act of converting the image from one form to another, in terms of size, shape, pixel ranges etc., while keeping essential image information largely unchanged. Ideally, this is aimed at increasing the robustness of the classifier as it is exposed to many variations of the same image class, and not only ‘nice looking’ ones.

An important step is to convert the image array information to tensors. This is the data format that PyTorch uses, instead of numpy or PIL arrays. The final tensor arrays will be of the form Channel * Height * Width (C * H * W), instead of the original (H * W * C). Hence, if we want to visualize an image, `permute` is used to restore the order.

Another important transformation step we will use is Normalization. As you know, a channel of a normal 24-bit color image has the intensity range from 0 to 255 (8-bit), with potentially very different distributions across different images. By normalizing them (usually to a range around 0 to 1), training would encounter less frequent non-zero gradients, leading to faster learning. More information on Normalization and other common transformation techniques can be found in the links below.

https://inside-machinelearning.com/en/why-and-how-to-normalize-data-object-detection-on-image-in-pytorch-part-1/

Improves CNN performance by applying Data Transformation

An experiment in PyTorch and Torchvision to boost your neuron network performance in Computer Vision

towardsdatascience.com

In PyTorch, common image transformation methods are available in the torchvision.transforms module. Multiple transforming steps such as resizing, augmenting and normalizing can be chained together using Compose. Now, let’s create the transformation pipeline, then use it in the Dataset class transform argument and visualize some resultant images.

# create a transform pipeline
image_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
    transforms.Resize((224, 224), interpolation=PIL.Image.BILINEAR)
])
#create datasets with transforms:
train_dataset = IntelDataset(csv_file='train.csv', root_dir="", transform=image_transform)
test_dataset = IntelDataset(csv_file='test.csv', root_dir="", transform=image_transform)

#visualize 10 random images from the loaded transformed train_dataset
plt.figure(figsize=(12, 6))
for i in range(10):
    idx = random.randint(0, len(train_dataset))
    image, class_name, class_index = train_dataset[idx]
    ax=plt.subplot(2,5,i+1) # create an axis
    ax.title.set_text(class_name + '-' + str(class_index)) # create a name of the axis based on the img name
    #The final tensor arrays will be of the form (C * H * W), instead of the original (H * W * C), 
    # hence use permute to change the order
    plt.imshow(image.permute(1, 2, 0)) # show the img

You can see that these images appear “darker” than the original versions, because the intensity values are now between 0 and 1, not 0 and 255.

5. Conclusion

With this, we have completed building our custom dataset for training CNN models. Stay tuned to my future articles on how to build Dataloader and use it to feed images to model training, as well as transfer learning / ensemble transfer learning.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com