The Architecture of the Models Changing the World

Published in

Data Science Student Society @ UC San Diego

8 min readMar 7, 2024

Discover the dynamic structure of the Convolutional Neural Network and understand the inner workings of the model archetype driving Computer Vision

What is a Convolutional Neural Network?

A convolutional neural network is a machine-learning model that can take in signal data and output a classification or set of classifications. This signal data may be one-dimensional, such as audio data, or more commonly two-dimensional (images). The ability of computers to meaningfully interpret images is the basis for computer vision.

Computer Vision

Computer vision is quickly emerging as one of the most important goals of artificial intelligence. From Face ID to text recognition, computer vision allows our computers to perceive the world in much the same manner we do. This ability for computers to recognize the world around them has enabled us to use them as aids in our day-to-day lives in ways we previously could only imagine. The self-driving car, for example, heavily relies on the perception abilities of Convolutional Neural Networks (CNNs).

Figure 1: An example of CNNS in self-driving cars. The architecture of the CNNS is described by the box diagram.

As pictured in the above figure, cameras and other sensors capture its surroundings. This first step is the duty of a CNN. In this context, images are preprocessed and fed into the CNN allowing for the car to detect obstacles, traffic signals, lane lines, and other important environmental features. Once these key items are detected, the car can decide how to proceed, fully informed by the CNN.

Convolution

The first step in understanding the architecture of a convolutional neural network is grasping convolution. Simply put, convolution is the act of sliding a weight matrix over an input matrix. The dot product of the input matrix with the weight matrix is the new value which takes the index of the center of the weight matrix. Moreover, convolution can be performed in both 1D and 2D, as we are dealing with images that are two-dimensional, the convolution will be performed in 2D. For reference, 2D Convolution for a grayscale image is given by the equation:

Figure 2: Convolution Equation

Every convolution has two key components: an image and a kernel. The kernel is the sliding window of weights. The sliding window is a matrix, commonly 3x3 which is laid over the image. Naturally, the image is the true picture represented by pixel values in the form of a matrix. From the summations in the equations above, “a” and “b” are the width and height of the kernel.

Figure 3: A Convolution example, here we see the kernel (second matrix) being used as a sliding window (green).

In the above example, the “same” convolution is performed (the kernel has already been rotated). That is the output image is of the same dimensions as the input image. In CNNs, “valid” shape is also common, in which only pixels for which the entire kernel is within the borders of the image are in the output. Whereas “same” convolution requires some sort of padding scheme, ‘valid’ does not introduce any bias via padding. Moreover, while you lose some data using “valid”, the data quality is ensured because no data is being imputed into the system arbitrarily.

Figure 4: Here is a small diagram showing how stride affects convolution.

Stride

Within the context of CNNS, ‘stride’ parameter is crucial for efficient feature extraction. Essentially, the stride dictates how far the convolutional filter moves across the input data during the convolution operation. By default, the filter is applied to each adjacent index, but with a specified stride value, the filter can skip pixels, resulting in fewer computations and altered output dimensions. A stride value of 1 implies that the filter moves one pixel at a time, maintaining the input’s spatial dimensions. However, increasing the stride to 2 or more causes the filter to skip pixels, leading to downsampled feature maps and reduced output dimensions. This downsampling can be advantageous for computational efficiency and helps in capturing higher-level features while having the added benefit of reducing the risk of overfitting. It’s important to note that adjusting the stride involves a trade-off. While larger strides accelerate computation and reduce spatial dimensions, they may also discard fine-grained spatial information, potentially affecting the network’s performance.

The architecture of a CNN for image classification

Figure 5: Sample architecture for image classification, AlexNet

Convolutional Layer

While a CNN can be built with many different hyper-parameters (ie: kernel sizes, number of layers, stride, etc…), their essential building blocks are as follows: The first convolutional layer works directly on the input image. At this initial stage, simplistic kernels will be able to detect edges and other basic information about the image, via convolution on the image directly with many kernels. These kernels take up random values for weights at initialization, but through training will be updated to useful kernels for the model. Furthermore, because multiple kernels are being applied here, the output of this layer is an array of images that are to be fed forward into the latter layers of the model. It is common for stride to be applied in these layers as well, which reduces the size of these images.

Relu Activation

At each convolutional layer, a Relu activation function is applied. The Relu activation function is essential to the CNN and other forms of Neural networks, as it introduces nonlinearity. The basic principle is that either a neuron is relevant or it’s not. That is a negative value from a neuron brings down the impact of another neuron firing, this is undesirable. Thus, a Relu activation function is used. In a Relu activation function, the input is the output of the convolution, and the output is the max of the input and zero. Thus, all negative values are set to zero, and all positive values are kept the same. While there are other activation functions, the Relu activation function is currently preferred due to the simplicity of the computation in both the forward feed and the backpropagation.

Pooling Layer

After a convolutional layer, it is common to introduce a pooling layer before the activation function. The pooling layer further reduces the dimensionality of the input. This pooling layer, similar to a kernel, is laid on top of the image, computes an aggregation of the overlaid area, and then is slid. Typically, the pools do not overlap, however, some models are improved via overlapping pools. The aggregation function is typically max or mean. This step is important because it reduces the size of the feature maps, which speeds up training by reducing the amount of parameters to fit. Additionally, it prevents overfitting by aggregating areas of the convolutional layer making the model more generalized.

Figure 6: Exemplified above is a pooling layer with stride. The dimensionality reduction capabilities of maxpool are clearly evident.

Layering

Typically, there are multiple convolutional and pooling layers. The amount of layers depends on how sophisticated of a model you want. A more sophisticated model can provide better accuracy, but it comes at the cost of training time, and data required, and runs the risk of overfitting. Sometimes there are multiple convolutional layers directly after one another. In this way, the model can build on the simple patterns of the previous layer to capture more abstract patterns and complete objects. These layers are not densely connected, that is not every input node affects the output of each output node, allowing the network to build and follow abstract patterns. Moreover, it reduces the number of inputs per output, speeding up training, and output, and decreasing the likelihood of overfitting.

Fully Connected Layers

The final layers of a CNN are fully connected. More concretely, each input contributes to the output of each node. Again, the activation function is used here, like a typical neural network, the input nodes are connected via weights to the output layer. In this step, the dimensionality is flattened so that all input nodes point directly to the output nodes. Explicitly, we lose the notion of a two-dimensional input here. From this point onward, the model will look much like a normal neural network. The dot product of the weights is taken with the input nodes, then the activation function is applied to see if the neuron fires and with what intensity. If it does, the neuron's value is fed forward as input for the next layer, which is either another fully connected layer or the output layer.

Figure 7: Pictured above is an example of fully connected layers in a CNN.

Softmax Layer

The softmax layer serves as the final step in many CNN architectures. Situated just before the output layer, its nodes correspond to the potential image labels, each holding the probability that the image corresponds to that label. These probabilities are determined by the preceding fully connected layer’s outputs, along with the biases associated with those outputs. Governed by the formula

, the exponential function is applied to each output, transforming them into non-negative values. Made possible by normalization, these values are bound in the range zero to one, summing up to one, which allows interpretation as probabilities.

At this stage, the architect can decide whether to return the single most likely classification or a set of the top classifications based on their associated probabilities. This flexibility enables different strategies for decision-making based on the specific needs of the application.

In summary, the softmax layer transforms the raw output of the network into a probability distribution, enabling clear interpretation and decision-making in CNN-based classification tasks.

Conclusion

The key difference between a CNN and other ANNs is the ability of the network to recognize locality as a factor. Crucially, the convolution operation affords the network this power. With this basic understanding of the CNN architecture, we can create a very strong image classification model based solely on linear algebra and convolution. Unfortunately, for a self-driving car, we have to progress from image classification to object detection. This topic is outside the scope of this article, but with the foundational understanding of CNNs, object detection is just an extension of what we have learned. The basic principle of object detection is sliding a CNN over an image at different scales and creating a cluster of classification problems. In this way, objects are classified as sub-images of the input. For more information on object detection see A Basic Introduction to Object Detection.

The Architecture of the Models Changing the World

Written by Liam Manatt