Residual Neural network on CIFAR10

Sometimes, skipping over is better than dealing one by one

This Article is Based on Deep Residual Learning for Image Recognition from He et al. [2] (Microsoft Research):

Residual Network (ResNet) is a Convolutional Neural Network (CNN) architecture which can support hundreds or more convolutional layers. ResNet can add many layers with strong performance, while previous architectures had a drop off in the effectiveness with each additional layer.
ResNet proposed a solution to the “vanishing gradient” problem.

ResNet Block

Neural networks train via backpropagation, which relies on gradient descent to find the optimal weights that minimize the loss function. When more layers are added, repeated multiplication of their derivatives eventually makes the gradient infinitesimally small, meaning additional layers won’t improve the performance or can even reduce it.


ResNet solves this using “identity shortcut connections” — layers that initially don’t do anything. In the training process, these identical layers are skipped, reusing the activation functions from the previous layers.

This reduces the network into only a few layers, which speeds learning. When the network trains again, the identical layers expand and help the network explore more of the feature space.

Built-In PyTorch ResNet Implementation:

PyTorch provides torchvision.models, which include multiple deep learning models, pre-trained on the ImageNet dataset and ready to use.

Pre-training lets you leverage transfer learning — once the model has learned many objects, features, and textures on the huge ImageNet dataset, you can apply this learning to your own images and recognition problems.

torchvision.models include the following ResNet implementations: ResNet-18, 34, 50, 101 and 152 (the numbers indicate the numbers of layers in the model), and Densenet-121, 161, 169, and 201.

ResNet Blocks

There are two main types of blocks used in ResNet, depending mainly on whether the input and output dimensions are the same or different.

  • Identity Block: When the input and output activation dimensions are the same.
  • Convolution Block: When the input and output activation dimensions are different from each other.

For example, to reduce the activation dimensions (HxW) by a factor of 2, you can use a 1x1 convolution with a stride of 2.

The figure below shows how residual block look and what is inside these blocks.

Source: Coursera: Andrew NG

Step 1: Prepare data set

Download the dataset and create PyTorch datasets to load the data.

There are a few important changes we’ll make while creating the PyTorch datasets:

  1. Use test set for validation: Instead of setting aside a fraction (e.g. 10%) of the data from the training set for validation, we’ll simply use the test set as our validation set. This just gives a little more data to train with.
  2. Channel-wise data normalization: We will normalize the image tensors by subtracting the mean and dividing by the standard deviation across each channel.
  3. Randomized data augmentations: We will apply randomly chosen transformations while loading images from the training dataset. Specifically, we will pad each image by 4 pixels, and then take a random crop of size 32 x 32 pixels, and then flip the image horizontally with a 50% probability.
# Data transforms (normalization & data augmentation)
stats = ((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
train_tfms = tt.Compose([tt.RandomCrop(32, padding=4, padding_mode='reflect'),
valid_tfms = tt.Compose([tt.ToTensor(), tt.Normalize(*stats)])
# PyTorch datasets
train_ds = ImageFolder(data_dir+'/train', train_tfms)
valid_ds = ImageFolder(data_dir+'/test', valid_tfms)

Next create data loaders for retrieving images in batches. We’ll use a relatively large batch size of 400 to utilize a larger portion of the GPU RAM. You can try reducing the batch size & restarting the kernel if you face an “out of memory” error.

batch_size = 400# PyTorch data loaders
train_dl = DataLoader(train_ds, batch_size, shuffle=True, num_workers=3, pin_memory=True)
valid_dl = DataLoader(valid_ds, batch_size*2, num_workers=3, pin_memory=True)

Step 2 : Using GPU

To seamlessly use a GPU, if one is available, we define a couple of helper functions (get_default_device & to_device) and a helper class DeviceDataLoader to move our model & data to the GPU as required.

device = get_default_device()
train_dl = DeviceDataLoader(train_dl, device)
valid_dl = DeviceDataLoader(valid_dl, device)

Step 3: Residual block

Here is a very simple residual block

class SimpleResidualBlock(nn.Module):
def __init__(self):
self.conv1 = nn.Conv2d(in_channels=3, out_channels=3, kernel_size=3, stride=1, padding=1)
self.relu1 = nn.ReLU()
self.conv2 = nn.Conv2d(in_channels=3, out_channels=3, kernel_size=3, stride=1, padding=1)
self.relu2 = nn.ReLU()

def forward(self, x):
out = self.conv1(x)
out = self.relu1(out)
out = self.conv2(out)
return self.relu2(out) + x # ReLU can be applied before or after adding the input
simple_resnet = to_device(SimpleResidualBlock(), device)

for images, labels in train_dl:
out = simple_resnet(images)

del simple_resnet, images, labels

Here is our resnet architecture , resnet 9

def conv_block(in_channels, out_channels, pool=False):
layers = [nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
if pool: layers.append(nn.MaxPool2d(2))
return nn.Sequential(*layers)

class ResNet9(ImageClassificationBase):
def __init__(self, in_channels, num_classes):

self.conv1 = conv_block(in_channels, 64)
self.conv2 = conv_block(64, 128, pool=True)
self.res1 = nn.Sequential(conv_block(128, 128), conv_block(128, 128))

self.conv3 = conv_block(128, 256, pool=True)
self.conv4 = conv_block(256, 512, pool=True)
self.res2 = nn.Sequential(conv_block(512, 512), conv_block(512, 512))

self.classifier = nn.Sequential(nn.MaxPool2d(4),
nn.Linear(512, num_classes))

def forward(self, xb):
out = self.conv1(xb)
out = self.conv2(out)
out = self.res1(out) + out
out = self.conv3(out)
out = self.conv4(out)
out = self.res2(out) + out
out = self.classifier(out)
return out
model = to_device(ResNet9(3, 10), device)

Step 4 :Training the model

Before we train the model, we’re going to make a bunch of small but important improvements to our fit function:

  • Learning rate scheduling: Instead of using a fixed learning rate, we will use a learning rate scheduler, which will change the learning rate after every batch of training. There are many strategies for varying the learning rate during training, and the one we’ll use is called the “One Cycle Learning Rate Policy”, which involves starting with a low learning rate, gradually increasing it batch-by-batch to a high learning rate for about 30% of epochs, then gradually decreasing it to a very low value for the remaining epochs.
  • Weight decay: We also use weight decay, which is yet another regularization technique which prevents the weights from becoming too large by adding an additional term to the loss function.
  • Gradient clipping: Apart from the layer weights and outputs, it also helpful to limit the values of gradients to a small range to prevent undesirable changes in parameters due to large gradient values. This simple yet effective technique is called gradient clipping.
def evaluate(model, val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)

def get_lr(optimizer):
for param_group in optimizer.param_groups:
return param_group['lr']

def fit_one_cycle(epochs, max_lr, model, train_loader, val_loader,
weight_decay=0, grad_clip=None, opt_func=torch.optim.SGD):
history = []

# Set up cutom optimizer with weight decay
optimizer = opt_func(model.parameters(), max_lr, weight_decay=weight_decay)
# Set up one-cycle learning rate scheduler
sched = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=epochs,

for epoch in range(epochs):
# Training Phase
train_losses = []
lrs = []
for batch in train_loader:
loss = model.training_step(batch)

# Gradient clipping
if grad_clip:
nn.utils.clip_grad_value_(model.parameters(), grad_clip)


# Record & update learning rate

# Validation phase
result = evaluate(model, val_loader)
result['train_loss'] = torch.stack(train_losses).mean().item()
result['lrs'] = lrs
model.epoch_end(epoch, result)
return history

To train our model instead of SGD (stochastic gradient descent), we’ll use the Adam optimizer which uses techniques like momentum and adaptive learning rates for faster training.

epochs = 8
max_lr = 0.01
grad_clip = 0.1
weight_decay = 1e-4
opt_func = torch.optim.Adam
history += fit_one_cycle(epochs, max_lr, model, train_dl, valid_dl,

Our model trained to over 90% accuracy in just 4 minutes!

Step 5 : Accuracy plot

Plotting accuracy vs no of epochs

accuracy vs no of epochs

Plotting Loss vs no of epochs.

loss vs no of epochs

It’s clear from the trend that our model isn’t over fitting to the training data just yet. Finally, let’s visualize how the learning rate changed over time, batch-by-batch over all the epochs.

learning rate vs Batch no

