Using PyTorch for Kaggle’s famous Dogs vs. Cats challenge Part 1 (preprocessing and training)

Published in

Predict

9 min readNov 12, 2018

For machine learning beginners who want to try out image classification problems, a good exercise might be building a binary classification model. Dogs vs. Cats challenge is just that! Really easy concept, You just need to teach a computer to tell dogs and cats apart. One can argue this as a ‘Hello world!’ of machine learning along with MNIST. However, for complete beginners it maybe difficult to choose a good architecture, produce the output that’s in the right format for the submission etc. For that reason, I am writing this post describing what I did for this competition. I also tried to create a kaggle kernel but realized kaggle kernel is only read-only system so I cannot move my data structure around and create my submission file so I will copy past my code to the post. FYI, I did this competition on my Macbook without a single GPU.

What you can expect in this post is 1) organizing train/validation datasets, 2) transfer learning, 3) saving / loading the best model, 4) make inferences from test dataset, 5) make submission file in right format and submit to kaggle and some more. Without further a due, let’s get to it.

1. Organizing data

Your data comes with train data and test data. Train data has both cats and dogs but they have class in file name ( cat.<id>.jpg for cat images, dogs.<id>.jpg for dog images.). Since PyTorch support loading image data from subfolders of data directory, we will have to put all cat images to cats folder and all dog images to dogs folder. We will also have to set apart validation set to check our model is learning properly. So crate subfolders cats and dogs inside train folder, create val folder under input folder and create the same subfolders inside val folder. Test data is unlabelled and it’s ok to leave as they are.

import os
train_dir = "./data/train"
train_dogs_dir = f'{train_dir}/dogs'
train_cats_dir = f'{train_dir}/cats'
val_dir = "./data/val"
val_dogs_dir = f'{val_dir}/dogs'
val_cats_dir = f'{val_dir}/cats'
print("Printing data dir")
print(os.listdir("data")) # Shows train, val folders are under data
print("Printing train dir")
!ls {train_dir} | head -n 5 # Shows image files are in train folder
print("Printing train dog dir")
!ls {train_dogs_dir} | head -n 5 # Check the (empty) folder exist
print("Printing train cat dir")
!ls {train_cats_dir} | head -n 5 # Check the (empty) folder exist
print("Printing val dir")
!ls {val_dir} | head -n 5  # Shows subfolder dogs and cats exist
print("Printing val dog dir")
!ls {val_dogs_dir} | head -n 5 # Check the (empty) folder exist
print("Printing val cat dir")
!ls {val_cats_dir} | head -n 5 # Check the (empty) folder exist

Run the code above in Jupyter notebook and check we have prepared the right folder structure. Next step is move files to the right folder.

import shutil
import refiles = os.listdir(train_dir)# Move all train cat images to cats folder, dog images to dogs folder
for f in files:
    catSearchObj = re.search("cat", f)
    dogSearchObj = re.search("dog", f)
    if catSearchObj:
        shutil.move(f'{train_dir}/{f}', train_cats_dir)
    elif dogSearchObj:
        shutil.move(f'{train_dir}/{f}', train_dogs_dir)

Let’s check if we moved files correctly.

print("Printing train dir") # shows cats, dogs subfolders only
!ls {train_dir} | head -n 5
print("Printing train dog dir") # there is now dog images in dogs folder
!ls {train_dogs_dir} | head -n 5
print("Printing train cat dir") # there is now cat images in cats folder
!ls {train_cats_dir} | head -n 5

Now let’s separate some dogs images for validation set. A lot of cases you will want to separate 20% of your whole data as validation set. In this case you have 25,000 images in training set which is quite many because cats and dogs are like ImageNet data. I thought 20% of that are too many and it’s enough to take out 1,000 images for cats and dogs each.

files = os.listdir(train_dogs_dir)for f in files:
    validationDogsSearchObj = re.search("5\d\d\d", f)
    if validationDogsSearchObj:
        shutil.move(f'{train_dogs_dir}/{f}', val_dogs_dir)print("Printing val dog dir")
!ls {val_dogs_dir} | head -n 5

Above code moves dog images whose id ranging from 5000–5999 to validation folder. And do the same for the cats image.

files = os.listdir(train_cats_dir)for f in files:
    validationCatsSearchObj = re.search("5\d\d\d", f)
    if validationCatsSearchObj:
        shutil.move(f'{train_cats_dir}/{f}', val_cats_dir)print("Printing val cat dir")
!ls {val_cats_dir} | head -n 5

2. Training model

Now that data is in the right structure, it’s time to train our model. First is importing what I need for this notebook. But this is not the complete list of imports, we will import the rest as they are needed.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
import mathprint(torch.__version__)
plt.ion()   # interactive mode

Let’s define training data augmentation and validation data transform.

# Data augmentation and normalization for training
# Just normalization for validation
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomRotation(5),
        transforms.RandomHorizontalFlip(),
        transforms.RandomResizedCrop(224, scale=(0.96, 1.0), ratio=(0.95, 1.05)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize([224,224]),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

As data augmentation I apply a little bit of rotation, random flips and resizing + crop. The scale of resizing is 0.96–1.0. I try to avoid giving scale less than 0.96 because you can still get the wanted variation in data and there is much less risk of cutting some important part of data off (e.g. head of the cat or dog, if we don’t have head part, it will be much harder for machine to learn how each class should look like). Also a moderate ratio change should be ok for our purpose (thinner or fatter, a cat is a cat right?). About Normalization, I used some hard coded value for mean and standard deviation. These values are known to work well and used frequently. Check this Facebook AI engineer’s recommendation for using those values and also official PyTorch example using the same value.

data_dir = 'data'
CHECK_POINT_PATH = 'checkpoint.tar'
SUBMISSION_FILE = 'submission.csv'
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                              shuffle=True, num_workers=4)
              for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classesdevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")print(class_names) # => ['cats', 'dogs']
print(f'Train image size: {dataset_sizes["train"]}')
print(f'Validation image size: {dataset_sizes["val"]}')

And then define some needed constants and define the datasets (train and val). You should see 23000 for train image size and 2000 for validation image size if you followed everything correctly. FYI, as I am working with no GPU and it frustratingly takes long time to process that much data I went to delete whole bunch of data and proceeded with 950 total train images and 71 validation images. For me that resulted in satisfactory accuracy and the purpose for me to do this was not about making most accurate model but to practice using PyTorch and kaggle website that’s why I chose to not to use so much data but of course you would not want to delete data if you are training a model for production purpose. Another reason that it was possible for me to train my model with so little data was I use pretrained model. Meaning I modified and trained only the last layer and used all the layers as they were because the model was well trained with ImageNet data already.

Let’s take a look at what a mini batch (4 images) from training set look like using the next snippet of code.

def imshow(inp, title=None):
    """Imshow for Tensor."""
    inp = inp.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)
    plt.imshow(inp)
    if title is not None:
        plt.title(title)
    plt.pause(0.001)  # pause a bit so that plots are updated# Get a batch of training data
inputs, classes = next(iter(dataloaders['train']))# Make a grid from batch
sample_train_images = torchvision.utils.make_grid(inputs)imshow(sample_train_images, title=classes)

You will see randomly selected 4 images and title will say 0 for cat and 1 for dog. Next let’s define a function that trains our model and return some metric.

def train_model(model, criterion, optimizer, scheduler, num_epochs=2, checkpoint = None):
    since = time.time()if checkpoint is None:
        best_model_wts = copy.deepcopy(model.state_dict())
        best_loss = math.inf
        best_acc = 0.
    else:
        print(f'Val loss: {checkpoint["best_val_loss"]}, Val accuracy: {checkpoint["best_val_accuracy"]}')
        model.load_state_dict(checkpoint['model_state_dict'])
        best_model_wts = copy.deepcopy(model.state_dict())
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        best_loss = checkpoint['best_val_loss']
        best_acc = checkpoint['best_val_accuracy']for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)# Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate moderunning_loss = 0.0
            running_corrects = 0# Iterate over data.
            for i, (inputs, labels) in enumerate(dataloaders[phase]):
                inputs = inputs.to(device)
                labels = labels.to(device)# zero the parameter gradients
                optimizer.zero_grad()
                
                if i % 200 == 199:
                    print('[%d, %d] loss: %.3f' % 
                          (epoch + 1, i, running_loss / (i * inputs.size(0))))# forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)# backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()# statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))# deep copy the model
            if phase == 'val' and epoch_loss < best_loss:
                print(f'New best model found!')
                print(f'New record loss: {epoch_loss}, previous record loss: {best_loss}')
                best_loss = epoch_loss
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())print()time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:.4f} Best val loss: {:.4f}'.format(best_acc, best_loss))# load best model weights
    model.load_state_dict(best_model_wts)
    return model, best_loss, best_acc

The function first checks if saved checkpoint is passed. If yes, then it will load the saved parameter and start training from where it left off. If no, then it starts training the model it was passed (we will still use pretrained model from beginning). The function updates parameters only in train phase and prints out some metrics every epochs or whenever it has new best loss.

Now let’s define our model by downloading pretrained model. It takes some time to run next code if you never ran it before.

model_conv = torchvision.models.resnet50(pretrained=True)

resnet50 is a convolutional neural network architecture that is really powerful for solving computer vision problems. Less powerful but less resource consuming models you might be interested in using include resnet18 and resnet34. By giving pretrained=True as an argument you will download a model with the parameters trained with ImageNet data set. Since we need to change the model for our needs (binary class classification), we will change the last fully connected layer and define a loss function that’s useful for classification problem (cross entropy loss, which combines log softmax and negative log likelihood loss function). Optimzier is stochastic gradient descent optimizer and scheduler is exponential because it will reduce learning rate by the factor of 10 every 7 epochs (In reality I only trained 6 epochs).

for param in model_conv.parameters():
    param.requires_grad = False# Parameters of newly constructed modules have requires_grad=True by default
num_ftrs = model_conv.fc.in_features
model_conv.fc = nn.Linear(num_ftrs, 2)model_conv = model_conv.to(device)criterion = nn.CrossEntropyLoss()# Observe that only parameters of final layer are being optimized
optimizer_conv = optim.SGD(model_conv.fc.parameters(), lr=0.001, momentum=0.9)# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_conv, step_size=7, gamma=0.1)

We can finally get into actual training.

try:
    checkpoint = torch.load(CHECK_POINT_PATH)
    print("checkpoint loaded")
except:
    checkpoint = None
    print("checkpoint not found")model_conv, best_val_loss, best_val_acc = train_model(model_conv,
                                                      criterion,
                                                      optimizer_conv,
                                                      exp_lr_scheduler,
                                                      num_epochs = 3,
                                                      checkpoint = checkpoint)torch.save({'model_state_dict': model_conv.state_dict(),
            'optimizer_state_dict': optimizer_conv.state_dict(),
            'best_val_loss': best_val_loss,
            'best_val_accuracy': best_val_acc,
            'scheduler_state_dict' : exp_lr_scheduler.state_dict(),
            }, CHECK_POINT_PATH)

The code first checks if any checkpoint is saved from previous training. If yes, pass the checkpoint to trainmodel function. Here I specified 3 epochs and the function returns model, loss, accuracy of when the loss was lowest during all epochs. We save what we got from the function to a checkpoint. You can adjust epoch number or rerun this snippet as many as you like. If you see the model is not improving anymore you can stop. As I had only 71 validation set images I reached accuracy 1.0 with loss 0.036 in only 2 runs so I decided to stop training.

Notice that we only needed to train the last layer we changed from original resnet50. It’s possible we also train all the parameters in all the layer but when I tried that, I only saw loss and accuracy getting worse. If you want to try updating all parameters you can do so like the following.

for param in model_conv.parameters():
    param.requires_grad = Truemodel_conv = model_conv.to(device)# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_conv.parameters(), lr=0.001, momentum=0.9)

And then run the same training loop code as we did before (the code block starts with try ).

Maybe this story is getting too lengthy already. So I will write the part you actually make inferences with the model using test data set and submitting your answer to kaggle in a separate story. Tune in for the next part and some more. Let me know if you succeeded in following my code and also if you got the satisfactory result using this method.

The second (last) part is out. If you want to continue to read, click here. You can also see the full code in my Github repo. Code for data preprocessing can be found in datapreprocessor.ipynb and training, inferences and submitting can be found in catsanddogs.ipynb.

Using PyTorch for Kaggle’s famous Dogs vs. Cats challenge Part 1 (preprocessing and training)

1. Organizing data

2. Training model

Written by Won Seob Seo