Convolutional Layers in Digit Recognition: More the Merrier?

An investigation into the relationship between test and validation metrics and number and nature of convolutional layers in a handwritten digit recognition network

17 min readAug 19, 2023

Introduction

Motivation

Being an absolute beginner to deep learning, building my first neural network was a trial by fire. It was a multiclass classification convolutional neural network as a part of a project to identify constellations from photographs. Though the network didn’t pan out too well for a few reasons (mostly the pitiable size of the dataset), it raised several questions that I never managed to get good answers to. How many hidden layers do I need? How many convolutional layers? What about maxpools? Dropouts?

It seemed the answer to most architectural questions was trial and error.

Unhelpful (but accurate) answers notwithstanding, this entire experience got me thinking about how the model architecture affected metrics like the accuracy of the model. Drawing on my high school experimental (mis)adventures, I decided to devise a method to actually investigate this.

Since my (limited) experience in deep learning was with image processing and classification, I decided to play around with probably the most common tools for the job: convolutional neural networks.

Background

So, what is a convolutional neural network? To answer this, it will suffice to know what convolutional and linear layers are.

Linear, or fully-connected layers, are, as the name suggests, a set of neurons with a weight (a multiplication factor) and bias (a value to be added), all connecting with all of the neurons in the previous layer.

Mathematically, it essentially takes in a matrix of data from the previous layer as input, applies a linear transformation (multiplying the input X by the weight matrix W and adding the bias vector B), and returns another matrix F(X).

Linear layers (usually in combination with activation functions) are often used for classification tasks, ie. once you have feature data, how do you know what class the input belongs to? They are helpful in taking these feature data and assigning a score to each class. Using activation functions like softmax, we can convert these class scores to classwise probabilities (the likelihood that the input belongs to a particular class). Because of this, linear layers are often used at the end after the convolutional layers.

Convolutional layers, on the other hand, are specifically designed to extract features from input data. They apply a set of filters to the input data, each of which is designed to detect a specific feature, such as edges or corners. The filters are applied to input data regions of sizes determined by the kernel, and the results are combined to produce a feature map. This process is repeated for each filter, resulting in multiple feature maps that represent different aspects of the input data.

Example of a Convolutional Layer (IndoML, 2019)

Mathematically, the convolutional layer is based on the mathematical operation of convolution. It can be described as a function that takes in an input image (represented as a matrix) and a filter matrix. The filter (also called the kernel) is slid over the image, and the element-wise product of the filter and the covered region is taken, and the values are summed to produce a single output. This is repeated for each position of the kernel to produce a new matrix, called the convolution of the input and the kernel. Therefore, a convolution can be described by the following equation:

Then, the equation for a convolutional layer with k filters, input matrix I, filter matrices F, and bias vectors B becomes:

All of the k feature maps are then stacked and passed to the next layer as a multichannel input.

What makes convolutional layers so good for image recognition? First, unlike a linear layer, not all of the neurons between two layers are connected. Connecting all of the input to all of the neurons would not account for the spatial structure of the data itself. To account for this structure, each neuron is only connected to a small region of the input area, called its receptive field. This form of connectivity, called sparse local connectivity, ensures that local features can be appropriately recognised and combined for higher-level representations of the data in successive layers. Second, another useful feature of convolutional layers is their ability to achieve translation invariance. That is, they can recognise a feature or pattern in the input data regardless of its location in the image. This is achieved through the use of shared weights. The same set of weights (ie. filters) are applied to different regions of the input data, allowing the same features to be detected at different locations in the image.

A convolutional neural network (CNN) is a combination of these linear and convolutional layers. Usually, there are a set of convolutional layers that take the input image and break it down into its features, and the linear layer uses these feature maps as inputs to assign classwise probabilities.

Now we know what a convolutional neural network is, how do we measure its performance?

There are two main metrics: Accuracy and Loss

Loss is a measure of how well the model is doing on the training data. This is what the optimisation process is trying to minimise during training. The value is usually calculated using a loss function like cross-entropy loss, which measures the dissimilarity between the predicted probability distribution generated by the model and the the true distribution. Loss is usually used in the training process to find the best parameter values.

On the other hand, Accuracy is a measure of how well the model can classify new, unseen data. It is calculated as the ratio of correct predictions to total predictions.

Research Question

What is the relationship between the test and validation metrics and the number and nature of convolutional layers in a handwritten digit recognition network?

Objectives:

To identify patterns and trends in the relationship between the test and validation metrics and the number and nature of convolutional layers in a handwritten digit recognition network.
To determine the optimal number and nature of convolutional layers for achieving optimal test and validation metrics in a handwritten digit recognition network.
To develop a better understanding of how the architecture of a CNN affects its performance on a handwritten digit recognition task.

Methodology

Dataset

Examples of handwritten digits in the MNIST database

For this investigation, I used the Modified National Institute of Standards and Technology (MNIST) database of handwritten digits. It is a fairly standard database of 28 x 28 pixel greyscale images for image recognition projects, with about 60,000 training images and 10,000 testing images.

Variables

Independent:

Independent variables are manipulated over the course of the investigation to observe changes in dependent variables.

Dependent:

Dependent variables are changed as a result of manipulating the independent variables.

Controlled:

Controlled variables are kept constant throughout the investigation.

Experimental Setup

The experiment was conducted using PyTorch Lightning, a research framework and high-level interface for PyTorch.

The data were loaded in using PyTorch Lightning’s DataModule class. The data were prepared, transformed to tensors, and split into training, validation, and testing datasets.

class MNISTDataLoader(pl.LightningDataModule):
    def __init__(self, batch_size=32):
        super().__init__()
        self.batch_size = batch_size

    def prepare_data(self):
        MNIST(root="./data", train=True, download=True)
        MNIST(root="./data", train=False, download=True)

    def setup(self, stage=None):
        if stage == "fit" or stage is None:
            mnist_full = MNIST(root="./data", train=True, transform = transforms.ToTensor())
            self.train_dataset, self.val_dataset = random_split(mnist_full, [55000, 5000])

        if stage == "test" or stage is None:
            self.test_dataset = MNIST(root="./data", train=False, transform = transforms.ToTensor())

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)

For the model architecture, the independent variables (number of convolutional layers and their growth status) were defined outside the recognition model class. They were varied throughout the investigation.

# Number and nature of convolutional layers
conv_layers = 8
growing = True

Within the model class, it is initialised to first set the controlled variables (learning rate, batch size, loss function, activation functions), and then use the independent variables to define a working architecture for the model. This involves defining the linear layer, and then making a list of convolutional layers with the appropriate number of input and output channels.

In the non-growing case (ie. growing = False), all of the convolutional layers are first set to have 32 in- and out-channels. Then, the first layer is changed to have only 1 in-channel, and the last layer is changed to have 784 out-channels, corresponding to the number of pixels in the image (28 x 28).

In the growing case (ie. growing = True), convolutional layers are defined with an exponentially increasing number of in- and out-channels, doubling the number of channels for every convolutional layer. Then, similarly to the non-growing case, the in- and out-channels of the first and last layers are changed.

These two cases (constant channels and growing channels) were chosen to provide a deeper and more thorough understanding of convolutional layers and their impact on performance metrics.

No other layers or operations like dropout or maxpooling were included to isolate the effect of adding convolutional layers.

# Model
class recognitionModel(pl.LightningModule):
    def __init__(self, lr=0.001, batch_size=32):
        super().__init__()

        # Define and save hpams
        self.lr = lr
        self.batch_size = batch_size

        # Model Architecture
        self.linear = nn.Linear(112896, 10)

        # Function definitions
        self.loss = nn.CrossEntropyLoss()
        self.softmax = nn.Softmax(dim=1)
        self.relu = nn.ReLU()

        if (conv_layers > 0):
            if (not growing):
                # Convolutional layers (constant number of channels in each layer [32])
                self.conv = nn.ModuleList([nn.Conv2d(32, 32, 3, 1) for i in range(conv_layers)])
                # First layer has 1 input channel
                self.conv[0] = nn.Conv2d(1, 32, 3, 1)
                # Last layer has 784 output channels (same as input channels of linear layer)
                self.conv[conv_layers-1] = nn.Conv2d(32, 784, 3, 1)
            else:
                # Convolutional layers (doubles number of channels in each layer starting from 32)
                self.conv = nn.ModuleList([nn.Conv2d(int(maths.pow(2, i + 4)), int(maths.pow(2, i + 5)), 3, 1) for i in range(conv_layers)])
                # First layer has 1 input channel
                self.conv[0] = nn.Conv2d(1, 32, 3, 1)
                # Last layer has 784 output channels (same as input channels of linear layer)
                self.conv[conv_layers-1] = nn.Conv2d(int(maths.pow(2, conv_layers - 1 + 4)), 784, 3, 1)

        elif (conv_layers == 1):
            self.conv = nn.ModuleList([nn.Conv2d(1, 784, 3, 1)])

        # Metrics
        self.train_acc = tm.Accuracy(task = "multiclass", num_classes = 10)
        self.val_acc = tm.Accuracy(task = "multiclass" , num_classes = 10)
        self.test_acc = tm.Accuracy(task = "multiclass" , num_classes = 10)

Then, the forward propagation step is defined, where the convolutional layers (if they exist) are applied to the input first, and then the output from the convolutional layers is flattened and passed to the linear layer for classification.

# Forward propagation step
    def forward(self, x):

        if (conv_layers > 0):
            # Convolutional layer(s)
            for i in range(conv_layers):
                x = self.conv[i](x)
                x = self.relu(x)

        # linear layer
        x = nn.Flatten()(x)
        x = self.linear(x)

        return x

Finally, the training, validation, and testing steps were defined in the recognition model class. All of them are relatively similar. First, a batch is defined using a set of inputs and their corresponding labels. These inputs are forward-propagated through the model to generate logits (output of the last layer before the activation function is applied), and then used to calculate the loss. Finally, the calculated metrics for the step (the accuracy and the loss) are logged, and the loss is returned.

def training_step(self, batch, batch_idx):
        inputs, labels = batch
        logits = self.forward(inputs)
        loss = self.loss(logits, labels)

        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        self.log('train_acc', self.train_acc(self.softmax(logits), labels), on_step=True, on_epoch=True, prog_bar=True, logger=True)

        return loss

def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        logits = self.forward(inputs)
        loss = self.loss(logits, labels)

        self.log('val_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        self.log('val_acc', self.val_acc(self.softmax(logits), labels), on_step=True, on_epoch=True, prog_bar=True, logger=True)

        return loss

def test_step(self, batch, batch_idx):
        inputs, labels = batch
        logits = self.forward(inputs)
        loss = self.loss(logits, labels)

        self.log('test_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        self.log('test_acc', self.test_acc(self.softmax(logits), labels), on_step=True, on_epoch=True, prog_bar=True, logger=True)

        return loss

Outside the recognition model class, the dataloader, the model, and the trainer are initialised using the requisite controlled variables and the PyTorch Lightning framework. In addition, a TensorBoard is also initialised as a tool to log and visualise the metrics of the model.

%load_ext tensorboard
%tensorboard --logdir=./lightning_logs
tensorboard = pl_loggers.TensorBoardLogger(save_dir="")
trainer = pl.Trainer(logger = tensorboard, max_epochs = 100)
model = recognitionModel()
print(model)
dataLoader = MNISTDataLoader()

In the main function, the trainer trains, validates, and tests the model, and a checkpoint of the trained model is saved.

# Main
if __name__ == "__main__":
    trainer.fit(model, dataLoader)
    trainer.validate(model, dataLoader)
    trainer.test(model, dataLoader)
    trainer.save_checkpoint("model.ckpt")

Link to the full experimental setup (Google Colab notebook)

Digit Recognition

Testing Code

colab.research.google.com

Execution & Data Collection

For testing, the runtime was cleared, the independent variables were changed, and the code was executed on a Google Colab T4 GPU runtime. The metrics were then recorded in a tabular format in a Google Spreadsheet up to 15 decimal places, and the generated TensorBoard was saved to TensorBoard.dev.

Results & Discussion

Raw Data

Note: the results for 0 growing convolutional layers and 2 growing convolutional layers had the same architecture as their non-growing counterparts. Therefore, their results are duplicated.

Non-Growing Convolutional Layers

In the non-growing case, as shown in Graph A, we observe that both the validation and test accuracy increased significantly when convolutional layers were added, going from 0.917 and 0.925 to 0.984 and 0.985 respectively. However, subsequent additions of convolutional layers yielded negligible improvements in accuracy, settling around a final validation and test accuracy of 0.986 and 0.988 respectively.

This may be because convolutional layers allow the model to learn local features and patterns, enabling the sharp rise in accuracy observed between 0 and 2. It is likely that the model is around maximum capacity after 4 - 6 convolutional layers, and thus, the inclusion of subsequent convolutional layers does not allow the model to learn any more useful information from the data. Therefore, the accuracy of the model may have stabilised after a certain number of layers.

From Graph B, we can see that the loss values are much more volatile; both are monotonically increasing between 0 and 4 convolutional layers, and then decreasing between 4 and 6. While the validation loss continues to decrease, the test loss increases slightly between 6 and 8 convolutional layers.

This volatility observed in the validation and test loss could be due to a number of reasons. Firstly, a constant learning rate, as chosen in this investigation, may cause problems with optimisation; a high learning rate may cause the model to overshoot the optimal solution and diverge, while a low learning rate could cause the model to get stuck in non-optimal local minima and converge too slowly. Secondly, it could be due to gradients being too small, introducing numerical instability to the model’s weights, leading to inaccuracies during backpropagation. These reasons, combined with a poor initialisation of the weights and biases could lead to the erratic loss patterns observed.

Metrics of growing convolutional layers (left figure: **Graph C**, right figure: **Graph D**)

In the growing case, as shown in Graph C, the trends in accuracy observed are similar to the non-growing case. The addition of convolutional layers leads to an initial increase in validation and test accuracy, and subsequent layers contribute to negligible increases in accuracy.

Similar to Graph B, the loss values were observed to be volatile in Graph D, steeply rising from 0.305 to a maximum validation loss of 1.134, and then decreasing to about 0.173. While validation loss had a steeper increase and a higher peak, it was only marginally higher than the test loss as the number of layers increased form 4 to 8.

Accuracy vs # Convolutional Layers (left figure: **Graph E**, right figure: **Graph F**)

Comparing the growing and non-growing cases for validation and test accuracy, we observe that the non-growing case consistently had a slightly higher validation accuracy compared to the growing case as the number of convolutional layers increased from 2 to 8. However, this difference was not significant and had little bearing on the final test accuracy, with the growing and non-growing cases recording a roughly equal test accuracy throughout the investigation. In both cases, the overall trend was that increasing the number of convolutional layers led to an increase in the accuracy of the model.

Loss vs # Convolutional Layers (left figure: **Graph G**, right figure: **Graph H**)

Comparing the growing and non-growing cases for validation and test loss, it is observed that the test and validation loss are quite volatile in both the growing and non-growing cases. In the non-growing case, the validation loss is relatively stable, showing much less variation than the validation loss for the growing case, which rises sharply between 0 and 4, but declines between 4 and 8, falling below the validation loss of the non-growing case. From the test loss trends observed in Graph H, it is evident that both the growing and the non-growing case followed a similar increasing trend between 0 and 4 layers. However, they diverged between 4 and 8 layers: The growing case consistently declined, falling below its initial loss to a minimum of 0.132, whereas the non-growing case declined between 6 and 8 layers, but rose slightly between 6 and 8 layers.

Overall, the loss for growing convolutional layers seemed to decline below the initial loss measured at 0 layers, while the loss for non-growing convolutional layers seemed to stay above the initial loss over the course of the investigation.

Correlation

From Table I, we can see that overall, there is a strong positive correlation between the number of convolutional layers and both measures of accuracy, as indicated by the high PPMC coefficient values. When differentiated by growth status, this strong positive correlation still seems to hold. The correlation between the accuracy metrics and the number of non-growing convolutional layers is marginally stronger than the correlation between the accuracy metrics and the number of growing convolutional layers.

For the loss metrics, we observe that there is a very weak, almost insignificant negative correlation between the number of convolutional layers and both measures of loss. When differentiated by growth status, there is a somewhat significant (but still very weak) negative correlation in the growing case, where the correlation coefficients are higher in magnitude (-0.106 and -0.231 respectively for validation and test loss) than the overall coefficients. This trend is not observed for the non-growing case, where the PPMC coefficient for the correlation between validation loss and number of non-growing convolutional layers is almost zero. Surprisingly, a very weak but significant positive correlation was observed between the test loss and number of non-growing convolutional layers.

Summary

As evidenced by Graphs A, C, E, F, and Table I, a significant positive correlation was observed between both the accuracy metrics and the convolutional layer count between 0 and 4 convolutional layers, after which the trend stabilised and the growth became insignificant as the number of layers increased. No statistically significant difference was observed in the accuracy metrics when differentiated by growth status.

As Graphs B, D, G, and H show, the loss metrics were too erratic and volatile to determine any statistically significant correlation. This may be due to errors in the experimental setup, such as the lack of a learning rate scheduler. Moreover, as Table I shows, when the convolutional layer counts were differentiated by their growth status, a weak negative correlation was observed in the growing case and a weak positive correlation was observed in the non-growing case. However, when the data were combined, the overall correlation between convolutional layer count and test loss was close to zero. This instance of Simpson’s paradox may also point to a confounding variable that was not controlled for in the experimental setup.

Limitations & Extensions

Experimental Setup

In order to isolate the effects of the nature and number of convolutional layers, this experimental setup did not account for a myriad of other factors that go into the architecture of convolutional neural networks. For example, CNNs used in the real world often contain layers and operations like batch normalisation, dropout layers, and pooling layers. These additions can help reduce overfitting, improve accuracy, and speed up convergence to reduce loss.

In addition, instead of a fixed set of hyperparameters as used in this investigation, most real-world CNNs employ schedulers to tweak their hyperparameters (like learning rate) during training to help improve the convergence and stability of the model. Some models also use techniques like early stopping to automatically stop training when the performance of the model stops improving.

In future iterations of this investigation, a more realistic experimental setup incorporating these techniques and layers could be devised to increase the applicability of the findings to real-world CNNs.

Range & Number of Trials

Due to the limited computational power and time available on Google Colab, this investigation was unable to include a larger range of convolutional layer counts; it was already prohibitively expensive at 8 growing layers on Colab in terms of computation time, and going further would require access to more powerful machines. The range used in the current investigation does not lend itself well to generalisation to real-world CNNs.

Due to the limited computation time, the number of trials were also limited. Since CNNs are not deterministic due to their random initialisation of weights, averaging the results of many trials could have yielded more useful and precise findings.

Thus, future iterations of this investigation could include a greater range of convolutional layer counts and a larger number of trials.

Task Selection

In the real world, CNNs are used for a variety of tasks, ranging from neural style transfer to multilabel classification. This investigation was only conducted on the task of multiclass image classification, a more specific case of multilabel classification. Thus, the findings of this investigation may not be generalisable to the different tasks that CNNs are used for.

Future iterations of this investigation could include obtaining metrics from CNNs over a variety of tasks for more generalisable results.

Conclusion

This investigation aimed to explore the relationship between the test and validation metrics and the number and nature of convolutional layers in a handwritten digit recognition network. The results showed a significant positive correlation between the accuracy metrics and the number of convolutional layers up to 4 layers, after which the growth became insignificant. However, the loss metrics were too erratic to determine any statistically significant correlation. An instance of Simpson’s paradox was also observed, suggesting the presence of a confounding variable that was not controlled for in the experimental setup.

These findings may have implications for the design of CNNs for image classification tasks. By understanding the relationship between the architecture of a CNN and its performance, it may be possible to optimise the design of these networks to achieve higher accuracy and lower loss. However, it is important to note that this investigation had several limitations, including a limited range of convolutional layer counts and a limited number of trials, which severely limits the applicability of these findings to real-world CNNs. Future research could address these limitations by incorporating additional factors into the experimental setup, increasing the range of convolutional layer counts and the number of trials, and conducting similar investigations on different tasks.

Overall, this investigation was able to partially address the research questions and objectives, enabling me to develop a better understanding of how the architecture of a CNN affects its performance on a handwritten digit recognition task. However, due to the limitations of the investigation, it failed to establish definite trends or determine an optimal number of convolutional layers.