Implementing CNN in PyTorch with Custom Dataset and Transfer Learning

Published in

Analytics Vidhya

10 min readAug 21, 2020

This article intends to guide on implementing CNN algorithms in PyTorch and assumes that you have some knowledge of CNN and its various models/architectures, the focus of this article will be on the implementation part with best coding practices for PyTorch. Inception is used in this particular use case because the modules were designed to solve the problem of computational expense, as well as overfitting, among other issues.

Key Takeaways:

Working with file directories in python
Creating Custom Datasets in PyTorch with Dataset and DataLoader
Using Transfer learning for Cats And Dogs Image Classification
How to move data to GPU for training and create efficient training loops

Dataset — https://www.kaggle.com/c/dogs-vs-cats/data

The Dataset consists of Cats and Dogs Images and our task is to classify images into their respective categories. It consists of a train and test folder along with a sample submission file(for kaggle submissions beyond the scope of this article).

Creating train_csv

import pandas as pd
import os
import torch

device = ("cuda" if torch.cuda.is_available() else "cpu")

train_df = pd.DataFrame(columns=["img_name","label"])
train_df["img_name"] = os.listdir("train/")
for idx, i in enumerate(os.listdir("train/")):
    if "cat" in i:
        train_df["label"][idx] = 0
    if "dog" in i:
        train_df["label"][idx] = 1

train_df.to_csv (r'train_csv.csv', index = False, header=True)

After importing the requisite libraries, we set device to cuda in order to utilize GPU resources for training. To check if GPU is being used one can use print(device) and the output will be either be “cuda” or “cpu” based on the availability of GPU in one’s system. For those trying to utilize GPU for training must install pytorch with cudatoolkit version — use this link for installation guide.

For the first part we need to create a csv file with the image filenames and their corresponding label for images in the train folder. Hence we create a pandas Dataframe with “img_name” and “label” as the headings. Then we use os.listdir to get a list of all file names in the “train/” directory. All file names have “cat” or “dog” as part of the name hence we use this as a conditional statement to create 0 or 1 label and add it to the label column in the dataframe. Finally we save the file so that we do not have to rerun the code every time to get the dataframe.

Creating Custom Dataset

from torch.utils.data import Dataset
import pandas as pd
import os
from PIL import Image
import torch

class CatsAndDogsDataset(Dataset):
    def __init__(self, root_dir, annotation_file, transform=None):
        self.root_dir = root_dir
        self.annotations = pd.read_csv(annotation_file)
        self.transform = transform

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        img_id = self.annotations.iloc[index, 0]
        img = Image.open(os.path.join(self.root_dir, img_id)).convert("RGB")
        y_label = torch.tensor(float(self.annotations.iloc[index, 1]))

        if self.transform is not None:
            img = self.transform(img)

        return (img, y_label)

Dataset is a pytorch utility that allows us to create custom datasets. PIL is a popular computer vision library that allows us to load images in python and convert it to RGB format. Our objective here is to use the images from the train folder and the image filenames, labels from our train_csv file to return a (img, label) tuple and for this task we are using the CatsAndDogsDataset class — it takes the root_dir(this is where the training images are stored) and the annotation_file(train_csv) as parameters. Transform has been set to None and will be set later to perform certain set of transformations on images to match input standards for the inception model which will be used later for CNN so if you don’t understand this just hold up!

The __init__ is an initializer which sets the parameters defining the class. The __len__ function returns the length of the dataset, in this case we return length of the self.annoations dataframe as it holds all the training file names which is the number of entries in the train_csv file. The __getitem__ function defines the (x,y) or (img,label) pair and how it can be extracted. Note that index is used internally within pytorch to keep track of the datapoints, create batches etc. to keep track of batches that have been loaded and those which are yet to be loaded — it takes care of all the book keeping of the dataset and is one of the novel features of a pytorch custom dataset. img_id is set to the file name of the image(from train_csv hence [index,0] where 0 is the img_name column). os.path.join uses the “/” symbol to combine the root_dir(“train/”) and img_name(image file name) from the csv file and then PIL is used to load the image and convert it to RGB format. Finally the y label is extracted from the train_csv file ([index,1] where 1 is the label column). Note that index is a pointer being used for accessing rows of the csv file and 0 or 1 corresponds to the column of the csv file. We are also enclosing it in float and tensor to meet the loss function requirements and all data must be in tensor form before being feed to a CNN model.

Creating Model

import torch.nn as nn
import torchvision.models as models

class CNN(nn.Module):
    def __init__(self, train_CNN=False, num_classes=1):
        super(CNN, self).__init__()
        self.train_CNN = train_CNN
        self.inception = models.inception_v3(pretrained=True, aux_logits=False)
        self.inception.fc = nn.Linear(self.inception.fc.in_features, num_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)
        self.sigmoid = nn.Sigmoid()

    def forward(self, images):
        features = self.inception(images)
        return self.sigmoid(self.dropout(self.relu(features))).squeeze(1)

The torchvision module has several inbuilt CNN models like VGG16, LeNet, ResNet etc. for computer vision and other tasks. In our example we will be using inception_v3 architecture. For those not familiar with inception model I highly recommend reading about it first before implementing it in code.

Transfer learning is a powerful technique wherein we use pre-trained models wherein the weights are already trained over large datasets(millions of images) and open sourced for all developers. The only important thing here is that the last few layers have to be modified according to the need of the developer’s project(fine tuning). Here we use train_CNN variable and set it to false, this will used as a flag to set parameters of the inception model to be either trainable or non trainable. The CNN weights will be used as it is and the fully connected layers will be modified from the original 1000 classes to 2 classes for our binary classification problem.

As seen in the code above the self.inception.fc has been modified to a linear layer that takes in the number input features of the original fc layer of the inception model and maps it to num_classes(binary classification). pretrained is set to True for all parameters however it will be set to False for for the last fc layer using train_CNN. aux_logits is a feature of the inception model whererin output is returned from intermediate hidden layers by attatching fc,softmax/sigmoid at a few places other than the last layer(read more about it online). For our case it has been set to false.Dropout is used for regularization with 0.5 probability of dropping weights in the fc layer.

For example if we have a batch of 32 images, then the output after applying inception, relu, dropout and sigmoid layers respectively we will get output in the shape [32,[1]]. However for applying Binary Cross entropy Loss function on the output, we require tensor to be of size [N,*] meaning we will have to get [32,] as the output size. Hence for this task we use squeeze(1) which removes the 1 dim at position 1 in the tensor size.

Importing Libraries for training Loop

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
from Model import CNN
from Dataset import CatsAndDogsDataset
from tqdm import tqdmdevice = ("cuda" if torch.cuda.is_available() else "cpu")

Transformations

transform = transforms.Compose(
        [
            transforms.Resize((356, 356)),
            transforms.RandomCrop((299, 299)),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
        ]
    )

The torcvhvision.transforms library allows us to do processioning and data augmentation on image during training. Then images that we will load from our Custom Dataset will undergo these transformations in order defined above. Resize ensures that all batched have same images dimensions so that training can occur in batches and also the to resize images to the recommended input for Standard CNN Models. RandomCrop crops the images at random locations. Finally we convert it to tensor and Normalize the images. For torch.Normalize the first tuple is mean of three channels(RGB) across all batches for each channel and the next tuple is the standard deviation of three channels(RGB) across all batches for each channel. It then uses the following formula to normalize the images where μ is the mean and σ is the standard deviation. Normalization is essential for speeding up training. Note that inception uses a value of 0.5 for μ and σ across all channels.

Hyperparameters

num_epochs = 10
learning_rate = 0.00001
train_CNN = False
batch_size = 32
shuffle = True
pin_memory = True
num_workers = 1

Pin_memory is a very important function. As the Custom dataset we created has all operations running in the CPU hence the data is also loaded into the CPU. It is only during training the the batches of images will be moved to GPU. pin_memory ensures that this movement of data is efficient and fast. In case one uses inbuilt datasets like MNIST or CIFAR10 then this parameter is not required as in that case data is loaded directly into GPU. The num_workers attribute tells the data loader instance how many sub-processes to use for data loading(mostly about vectorization). By default, the num_workers value is set to zero.

Setting the dataset and dataloader

dataset = CatsAndDogsDataset("train","train_csv.csv",transform=transform)
train_set, validation_set = torch.utils.data.random_split(dataset,[20000,5000])
train_loader = DataLoader(dataset=train_set, shuffle=shuffle, batch_size=batch_size,num_workers=num_workers,pin_memory=pin_memory)
validation_loader = DataLoader(dataset=validation_set, shuffle=shuffle, batch_size=batch_size,num_workers=num_workers, pin_memory=pin_memory)

The training data is divided into train and validation split to allow us to use early stopping later on to grab the model that gives best validation accuracy.

model = CNN().to(device)

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for name, param in model.inception.named_parameters():
    if "fc.weight" in name or "fc.bias" in name:
        param.requires_grad = True
    else:
        param.requires_grad = train_CNN

The flag which we set earlier is now being used to set the fc layers to trainable and all other layers to non — trainable to avoid back-propagation through those layers. The CNN().to(device) moves the model to GPU. Note for GPU training both the model and data must be loaded to the GPU. Refer to torch docs for input formats for BCELoss and Adam optimizer.

Accuracy Check

def check_accuracy(loader, model):
    if loader == train_loader:
        print("Checking accuracy on training data")
    else:
        print("Checking accuracy on validation data")

    num_correct = 0
    num_samples = 0
    model.eval()

    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device)
            y = y.to(device=device)

            scores = model(x)
            predictions = torch.tensor([1.0 if i >= 0.5 else 0.0 for i in scores]).to(device)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)
    return f"{float(num_correct)/float(num_samples)*100:.2f}"
        print(
            f"Got {num_correct} / {num_samples} with accuracy {float(num_correct)/float(num_samples)*100:.2f}"
        )
    model.train()

We check for either train or validation loader and set the output accordingly. torch.no_grad() ensures that model is not in training mode and is simply applying the model weights to get predictions for calculating the training/validation accuracy. As seen above the images and labels are moved to device after being loaded from the loader and then a predictions tensor is set by rounding the final values returned by the sigmoid layer to 0 or 1(0 — cat, 1 — dog) and moved to GPU. We also keep track of the number of samples by incrementing num_samples by batch_size as the batches keep loading. The num_correct compares the predictions to the true labels and returns the total number of correct predictions. Finally the function returns an accuracy for the entire dataset (training/validation depending on what we input to the function). Note its important to put the model in eval mode(model.eval()) to avoid back-prorogation during accuracy calculation. Also important to note that after accuracy check we will continue training in search of better accuracy hence at the end the model is set to train mode again(model.train()).

Training Loop

def train():
    model.train()
    for epoch in range(num_epochs):
        loop = tqdm(train_loader, total = len(train_loader), leave = True)
        if epoch % 2 == 0:
            loop.set_postfix(val_acc = check_accuracy(validation_loader, model))
        for imgs, labels in loop:
            imgs = imgs.to(device)
            labels = labels.to(device)
            outputs = model(imgs)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            loop.set_description(f"Epoch [{epoch}/{num_epochs}]")
            loop.set_postfix(loss = loss.item())

if __name__ == "__main__":
    train()

For each epoch we iterate through all batches of images and labels in the train loader and move them to GPU(Batch wise). We then use our model’s output and calculate the loss using BCELoss funtion. Before we do back-propagation to calculate gradients we must perform the optimizer.zero_grad() operation- this empties the gradient tensors from previous batch so that the gradients for the new batch are calculated anew. Now to perform back-propagation we use loss.backward() and then finally update the weight parameters using optimizer.step() with the newly calculated gradients. As noticed from the code above there is a loop variable defined — it uses the tqdm library which comes handy to create a progress bar during training in the terminal/console. leave = True ensures that the the older progress bars stay as the epochs progress alternatively setting it to False will make the older progress bars from the previous epochs leave and display it only for the current epoch. There has also been added to the tqdm the loss and the accuracy(which is printed every two epochs to see how it performs on the validation set). The last part is essential to run the code in script for notebooks its not necessary.