Convolutional neural networks are regarded as the superior technique for classifying images, objects within images, and even individual pixels in an image (semantic segmentation) according to the class they belong to. However, the literature is riddled with gaps that I believe are necessary to form a complete understandings of why they work as well as how. This article seeks to remedy said gaps, and a prerequisite understanding of basic neural networks is assumed.
The visual cortex
Studies done in the 50’s and 60’s revealed the overall architecture of the visual cortex as a hierarchy of neurons. Neurons at the bottom of the hierarchy treat the entire visual field as a grid, where each neuron took some cell within the grid to search for simple patterns. The next neurons in the hierarchy are more complex in structure and function, taking the inputs from the initial neurons and constructing more complicated patterns. Since these complex neurons were taking inputs from multiple “simple” neurons, it follows that they respond to a larger area of the grid, called a “receptive field”. The receptive field for the simple neurons would be 1 grid square (arbitrary), and the receptive field for the complex neuron would be the sum area of the grid squares corresponding to the “simple” neurons it takes input from. Grid squares (receptive fields) for simple neurons tend to overlap. A visual field with dimensions of 5x5 arbitrary units of length and divided into 25 receptive fields will end up having spatial dimensions larger than 5x5 when the receptive fields are aligned edge-to-edge.
The efficacy of the visual cortex speaks for itself. We will refer back to this information throughout the rest of the article.
A computational model for “simple” neurons
To replicate the methods of the visual cortex, we want to begin by completing the following three tasks;
- Breaking the input image into a grid
- Assigning classifiers to receptive fields to search for patterns
- Apply the classifiers to all the receptive fields in an image
Task #1 is to break the input image into a grid. The main choice we must make here is how large we want the receptive fields to be for the simplest layer of neurons. In a perfect world, the ideal shape of a receptive field is a circle, because the symmetry ensures no assumptions about what the patterns will look like. Unfortunately, this is not feasible, so squares are the next best option for maximum symmetry. The symmetry argument asserts that we need squares with odd-integer dimensions, so the filters have a center. We will also choose small receptive fields, because they are able to extract a vast amount of simple patterns which can later be assembled into more complex patterns. By starting with more complex patterns, you limit the building blocks available to you to build new things. In addition, the number of parameters in the model increases proportionally with the area of the receptive fields, making smaller receptive fields computationally efficient despite needing more of them. In the context of machine learning, this is known as a win-win. Receptive fields are usually 3x3, 5x5, or 7x7.
Task #2 is to assign a classifier to receptive fields, which we choose to be 3x3, with two pixels of overlap between regions. The first step is to determine the classes, or features, that we will search for. We can create another 3x3 grid, turn it into a heat map, and place in values to visually construct simple patterns. Taking the dot product of this grid (called a filter) with a receptive field will return a single number that represents the amount of cross-correlation between the filter and receptive field, i.e. how similar the filter is to the receptive field. We can choose a variety of different filters to search for different patterns in receptive fields. Remember that these “patterns” belonging to the filters are the building blocks to create more complex patterns, so it is important to make sure we have enough.
Task #3 is to apply these filters to every receptive field in an image. Recall that the receptive fields usually overlap. Our receptive fields are 3x3, and they overlap by 2 pixels, meaning that for an image that is 28x28 pixels, the first row will contain 26 receptive fields, since we assume the receptive fields cannot extend beyond the visual field. Recall that each of the 26 3x3 receptive fields will produce a single number as an output when a filter is applied. Applying a filter to all rows, our output will be 26x26. This is very similar to the resolution of the original image- in fact, by choosing the right filter, we can perform tasks such as sharpening or blurring an image.
Worked example in excel
Let’s take a look at how applying filters can extract “features” from an image by doing this in excel. We will start from an image from the Mnist dataset, which is made up of 28x28 images of handwritten digits.
As per our procedure outlined above, we begin by creating a 3x3 filter to map over the image. We will choose the vertical edge detector mentioned previously (it should be noted that this edge detector only detects edges on the right).
Now, we simply take the dot product of this filter by the first 3x3 region appearing in the image, and repeat this process over the the entire image.
There are a few observations to be made here. The largest (positive and negative) values are where the edges are very close to perfectly vertical. We can also make horizontal edge detectors, and diagonal line detectors, although horizontal and vertical work best since our receptive fields are squares. This is why (as mentioned previously) the ideal shape of a receptive field is a circle; perfect symmetry means edges and curves of all angles are equally easy to detect.
Mathematical Insights (optional)
The term “convolutional” in a CNN comes from the mathematical operation of convolution. The process described above, multiplying a 3x3 filter by every 3x3 sub-region appearing in a much larger grid, is far beyond the capabilities of ordinary matrix multiplication operations (used by regular neural networks). We must make use of the convolution operator, an arcane yet incredibly useful mathematical technique employed by every branch of science to describe waves.
This equation is the one dimensional convolution theorem for two continuous functions f(x) and h(x). Notice that although they are functions of x, the integral is with respect to τ. If we plot each of the functions under the integral with respect to x and pick an arbitrary value for τ, f(x-τ) will look like f(x) shifted by τ, and h(τ) will look like a vertical line representing a number. If we now vary τ slowly, we will see the vertical line start to slide across the function f(x-τ). If we integrate the product of these functions between some τ1 and τ2, we will get the area of overlap of the two functions between τ1 and τ2. This theorem, the convolution operator, computes the area of overlap of two functions.
Now, let us modify the function to repeat this procedure in the discrete case. We will replace the integral with a sum, and sum over an arbitrary number of pixels in a 1d image. Our function f(x) will be a “discontinuous” function representing the image, and h(x) will be the filter.
This will compute the dot product of the 3x1 filter with each 3x1 region in the 1d input image. We can expand this to two dimensions for a 3x3 filter and 2d image.
This operator mimics the “sliding” of our filter over every region where it fits on the image, and returns the feature map.
If you take one thing from this section- let it be this- the convolution operator is associative.
Let’s say our input image is 28x28, and we want to convolve it with a 3x3 filter. Then, convolve the output feature map with another 3x3 filter, and do this one more time to get a final feature map. We will denote the filters as F and feature maps as M.
The output of the input image and filter F1 will me feature map M1, so we can simply substitute to get this statement only in terms of filters and the input image. The critical idea here is that the convolution operator is associative. This means we can order the parenthesis in any way we choose to get a correct statement, including;
This tells us that we do not need multiple convolutional layers, and that we could compute one filter Fn that could be convoluted with the input image to yield the output map(s). This is mathematically correct, and problematic in practice. We need to introduce nonlinearity to the output maps in order to fix this issue. If we were to apply a linear function, this would effectively just multiply the feature map Fn by some factor and add a constant. Nonlinearity will guarantee that each filter we apply yields a feature map that cannot be produced by any other filter or combination of filters. The function we choose will have to be called hundreds, thousands, or millions of times; once for each pixel of each output map in each layer. The simpler the function, the more efficient our network. The function we apply to the feature map outputted by each convolution is called an activation function. Convolutional layers almost exclusively use the activation function known as RELU.
As you can see, it is very simple indeed, and nonlinear.
More about RELU
Let’s loop back to the example in Excel to illustrate why this barely nonlinear function actually works quite well.
When we measure the cross-correlation of our image and filter, we can get positive and negative values corresponding to the “opposite” of that feature. This is not helpful at all! We want separate filters for each pattern that appears, even if one pattern is the opposite of another pattern. The RELU activation function helps us solve this problem by only showing us positive correlation.
If you are not yet convinced that RELU is the only choice for activation functions following convolutional layers, consider this. Backpropagation, used to optimize neural net parameters, computes an “error” term starting from the output layer, and propagating backwards to the first hidden layer. The amount that the parameters in layer L-1 are tweaked is proportional to the product of the error in the output layer L with the derivative of the activation function evaluated at layer L-1 (dramatic oversimplification). In most cases, the derivative ends up being a decimal, and since the error in layer L is also a decimal, the product ends up being smaller. As we repeat this procedure for more and more layers, this factor that drives parameter adjustments becomes smaller and smaller, and our network is unable to change efficiently. This is called the vanishing gradient problem, and RELU solves it because the derivative is one! The parameter adjustments in each layer are thus on the same scale.
Extracting more complex features
The previous section contained an example of computing feature maps when each layer in the network has just one filter. Now, we can extend this to multiple features per layer, and illustrate the mechanism for combining smaller features into more complex features.
Let’s say we have 4 filters in the first convolutional layer, and 8 features in the second. We discussed in the previous section that the first layer will output 4 feature maps, produced by convolving the input image with each of the 4 filters (via the mathematical operation called convolution). For a 28x28 input image, each filter will produce a 26x26 output matrix, and each value is passed through the RELU activation function before being officially deemed a feature map.
If we proceed to the next step with what we know so far, we encounter a dimensional issue. The filters we have been using act on an input image, and output one or more feature maps. In our example, a single image has dimensions (28, 28, 1), and the filter has (3, 3, 1). The extra 1 was included because three dimensions are now needed to represent the output of this first layer. Our output will be of size (26, 26, 4).
Option one is to apply our 8 (3, 3, 1) filters to each of the feature maps, which would yield an output size of (24, 24, 32). The problem with this, however, is it only constructs new features out each prior feature. For example, if one of the four filters in layer 1 was a horizontal edge detector, and one of the 8 filters in layer 2 was a vertical edge detector, applying them both (one possibility of the 32 combination) will yield virtually nothing. This will not work.
Option two is to increase the dimensions of the filter. If our layer 1 output is (26, 26, 4), we simply expand the filter to (3, 3, 4), and apply it 8 times. This yields output dimensions from layer 2 of (24, 24, 8). The mechanism of these 3 dimensional filters need almost no mathematical modification, despite the additional dimension. They operate on the same receptive fields in each of the four input images at the same time. Consider the example of a color image as the input instead. Our dimensions are now (28, 28, 3) due to the 3 color channels. If we apply a (3, 3, 3) filter, it follows that we should be looking in the same place every time we slide the filter, only treating each color channel separately.
By increasing the dimensions of the filter to include all outputs from the previous layer, this opens the door to combining multiple simple features from the previous layer. For example, the first layer of the (3, 3, 4) filter may have all 1’s, along with the third layer, and zero everywhere else. This has the effect of combining the first and 3rd input feature maps (ignoring 2 and 4) and calling that a new feature. These could be vertical and horizontal edge detectors, which together detect either edge, as well as corners. More complex features are made simply by various combinations of features in the prior layers.
We can keep increasing our layers to our heart’s content. In the beginning, we should only aim to detect the most basic of patterns, eventually leading to more complex patterns like circles, squares, parallel lines, and eventually patterns like bird legs, elephant trunks, and socket wrenches.
Extracting larger features
Let us briefly return to the example of applying 3x3 filters sequentially to an input image, so each layer has 1 feature map output. We started by extracting the most simple features possible with the smallest filter possible (1x1 filters = 1 pixel, no patterns!). We continue to make more complex features and feature maps, but always with a 3x3 filter. When our feature map evolves, the only change it undergoes in dimension is a cropping of one pixel around the perimeter. This means that a 3x3 filter in the last layer can only detect complex features within a 3x3 receptive field on the original image. This means that in our final convolutional layer, the confidence of our predictions will be dependent on the linear combination of features corresponding to 3x3 regions on the original image. Although, the features will be more complex than can be produced from a single convolutional layer, since nonlinearities are involved. The problem encountered here is that many of the features that can help us classify our images are much larger than 3x3 pixels.
We can solve this problem by donwsampling. To downsample is to reduce the dimensions of the output feature map, and there are several ways to do this. Reducing the dimensions of the output feature map has the effect of expanding region of the input image that is incorporated into subsequent feature maps. There are three ways to do this.
Method #1 consists of pooling. Pooling consists of breaking up a feature map into a grid of (usually) 2x2 squares. In each of the squares, the maximum (almost always) is taken, and the rest of the pixels in each box are thrown away. This results in a feature map 1/4 the size (26, 26) -> (13, 13). The maximum is almost always used, because it represents the strongest activation, i.e. the highest cross-correlation between the detected feature and the receptive field. Minimum pooling is never done because of the activation function used (RELU), which takes the maximum of 0 and the input. This results in zeros occurring disproportionately frequently, and “minimum” pooling would throw away the activations.
Max pooling is by far the most common method of downsampling. This is because the larger reduction in feature map dimensions compared to these other two methods leads to far fewer parameters in the model, and a quicker training time.
Method #2 is to increase the stride of the convolutions. When we defined the convolution operator earlier, and applied it to our problem, we specified that it slides the filter by one pixel each time. Instead, we could slide it by two pixels. Applying the filter at every other receptive field cuts the expected dimensions in half. Applying a 3x3 filter to a 28x28 image should give us a 26x26 feature map- cutting each dimension in half gives us a 13x13 feature map.
Illustrated here is a solution (using stride) to the problem presented at the beginning of the section. By downsampling, the region of the input image (receptive field) that is used to compute each point in a feature map becomes larger.
This method works to some degree, but has non trivial drawbacks. In nature, we find that receptive fields do overlap. As discussed extensively in neuroscience literature, the reason for this has to do with spatial correlation. A very small region in a natural image will tend to be somewhat correlated with the regions immediately surrounding it- think of looking directly at some object, and then shifting your focus ever so slightly in one direction. Are you still looking at the same thing with the same visual properties? Probably, yes. This is spatial correlation.
If we choose a high stride such that there is no overlap between receptive fields, we are saying that one receptive field should not care what is going on right next to it. Imagine putting together a jigsaw puzzle of a landscape and all the pieces were perfect squares. Certainly you could still make similar conclusions about the patterns within each piece, but the additional information about surrounding pieces offered by the familiar puzzle piece shape makes it much more clear how smaller patterns can be arranged into larger ones.
If our receptive fields overlap, we “smooth” our information. Instead of sharply transitioning from one receptive field to another, we have a “midpoint” that tells us what happens between receptive fields via another receptive field. This additional information makes it easier to piece together smaller patterns to create larger ones. The drawback here is that additional information means additional computations- we find in nature that that there is a trade-off between accuracy gained from recognizing spatial correlation, and redundancy from recognizing spatial correlation.
Studies vary widely by methods and calculations, but any random point in the visual field may be contained in 3 to 7 receptive fields. For a 3x3 filter taking steps of 1 pixel, a random pixel far from the edges will be contained in 9 receptive fields. Taking steps of 2, that number drops to 4. The size of the step you take in terms of number of pixels is referred to as stride in the literature. Recall that a stride of 1 will return a feature map of similar dimensions to your original image, and a stride of 2 will halve the dimensions. I have found (experimentally) that although stride of 2 for a 3x3 filter produces overlap more similar to what is observed in nature, a stride of 1 followed by a pooling layer returns better accuracy when downsampling is needed. This is because the trade off between accuracy and efficiency is not the same for computers.
Method #3 is to increase the size of your filter, and stride. If you increase the size of your filter, a larger stride is actually necessary to reduce the amount of overlap. I find that for every two pixel increase in filter dimension(3x3 -> 5x5), it is good practice to increase stride by one. As i mentioned, stride 1 is generally better for a 3x3 filter, so this implies a stride of 2 for a 5x5 filter. I have observed that following this rule is an excellent starting point when the image dimensions are large. For large images, more downsampling is needed quicker, and it may take longer to form more complex patterns when only using 3x3 filters. Scaling filter size (to a certain extent) to work with larger images (I would advise against filters larger than 7x7) is marginally better than max pooling.
Max pooling is generally the go-to for downsampling, and experimentally has better results than increasing stride.
Increasing filter size and stride is a good option in the initial layers when the image size is large, but otherwise, probably not a good idea.
Downsampling is important to increase the size of the region of the input image that determines the output. We want our features to be computed from larger regions to identify larger features. Even if the image is small, incorporating at least one downsampling procedure will usually improve results.
Last but not least, we can use the features extracted with successive convolutional layers to make predictions. Typically, the feature maps in the last layer are flattened to individual nodes, and those nodes are treated as a dense layer. This dense layer is the “input” layer to a regular neural network, which will give us much more familiar outputs. We typically want small feature map dimensions in the final convolutional layer considering each point will become a node, and fully connected layers have a lot of parameters.
Earland, Kevin, et al. “Overlapping Structures in Sensory-Motor Mappings.” PLoS ONE, vol. 9, no. 1, 2014, doi:10.1371/journal.pone.0084240.
Dettmers, Tim. Understanding Convolution in Deep Learning. 26 Mar. 2015, timdettmers.com/2015/03/26/convolution-deep-learning/.
“Convolution.” Convolution — an Overview, ScienceDirect, www.sciencedirect.com/topics/mathematics/convolution.