A Summary of Neural Network Layers

Li Yin
Machine Learning for Li
13 min readJul 22, 2018

A core step learning and applying neural networks in real project is to understand different neural network layers: various convolution layers, pooling layers, Deconvolotuional layers, fully connected layers, or fully connected convolution layers. To understand it, we not only need to understand the visualized operations, it is essential for us to understand the math behind it. With such knowledge as base, we can further analyze each type of layer’s advantages and disadvantages.

Any neural network in the filed of computer vision that is applied on 2-dimensional images is composed of two parts: layers function as feature extractor and layers that are task-specific, either classification or segmentation. Feature extractor usually include convolution layer and pooling layer. As for the task-specific layers: (1) for classification task, usually a fully connected layer is attached behind the feature extractor so that it can convert the 2 dimensional feature extractor to a vector of feature. (2) for segmentation task, a fully connected layer is usually replaced by a fully connected convolution layer in order to retain more location-specific information.

And then, a specific activation function and cost/loss function is applied. For multi-class classification/segmentation, Cross Entropy is used, which we will talk about more loss function in the next post. Now, Let us focus on understanding different neural network layers.

Convolution Layer

The role of a convolution layer is to detect local features at different positions from the input feature maps with learnable “filters”/ “kernels” .

Convolution Operation

For image processing, the convolution operator is typically denoted with an asterisk:

s(t) = (x*w) (t),

where x(t) referred to as the input, and w as the weights of the kernel. Intuitively , it operates by adding neighboring entries in a matrix, weighted by the kernel. This operations is the same as the conventional filters: gaussian filter, bilaterial filtering and so on.

The convolution operations is expressed as below in math:

Equation 1: Convolution Operation with stride 1, the input x and output s will have the same size volum

There are three hyperparameters deciding the spatial of the output feature map:

  • Stride (S) is the step each time we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.
  • Padding (P): As we see from Eq. 1, when i=0, j=0, and assuming m = -2, n = -2, then we need position in the input with index as (-2, -2) which does not exist. Most commonly, zero-padding is used to pad these locations. In neural network frameworks (caffe, tensorflow, pytorch, mxnet), the size of this zero-padding is a hyperparameter. The size of zero-padding can also be used to control the spatial size of the output volumes.
  • Depth (D): The depth of the output volume is a hyperparameter too, it corresponds to the number of filters we use for a convolution layer. In Eq.1, the depth = 1, and the input has a channel of 1 too.

Decide the size of the ouput

Given w as the width of input, and F is the width of the filter, with P and S as padding, the output width will be:

(W + 2P-F)/S+1

Generally, set P=(F−1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially.

Invalid configuration: when the input has size W=10, no zero-padding is used P=0, and the filter size is F=3, then it would be impossible to use stride S=2, since (WF+2P)/S+1=(10−3+0)/2+1=4.5, i.e. not an integer, indicating that the neurons don’t “fit” neatly and symmetrically across the input.

Dilated convolutions

A recent development (e.g. see paper by Fisher Yu and Vladlen Koltun) is to introduce one more hyperparameter to the CONV layer called the dilation. So far we’ve only discussed CONV filters that are contiguous. However, it’s possible to have filters that have spaces between each cell, called dilation. As an example, in one dimension a filter w of size 3 would compute over input x the following: w[0]*x[0] + w[1]*x[1] + w[2]*x[2]. This is dilation of 0. For dilation 1 the filter would instead compute w[0]*x[0] + w[1]*x[2] + w[2]*x[4]; In other words there is a gap of 1 between the applications. This can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers. For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5). If we use dilated convolutions then this effective receptive field would grow much quicker.

Now, let us see different types of convolution layer:

Standard convolution layer:

Normal convolution process: first do chanel-channel convolution and we get channel number of feature maps, and sum different channel together, we get one output channel

Normally, given an input of size WHD1, the filters will be F1×F2×D1×D2. As shown above, first each channel in the input do convolution with its corresponding channel in the ith set of filters, then we get a set of feature maps of the same size, and the size of this set is the same as the input channel D1. Then we sum up this set of feature maps of all channels, we get one output feature maps. Then we do D2 sets of same operations, we get D2 output channel.

number of parameters=(F1×F2×D1+1)×D2

We can see in this process, the sum up across different channels mixed the signals between different channels. There is another convolution layer called Depthwise Separable Convolution layer it does not do this sum up operations.

Depthwise Separable Convolution Layer[1]

In depthwise separable convolution layer, instead of having filter with D2 set of F1×F2×D1, it only uses one set as F1×F2×D1. Same as normal convolution, we get D1 feature maps of size W2×H2, but instead of summing them up to just get one channel, we do a normal convolution of D1×W2×H2 with another 1×1 convolution of D1×D2, then we get same output size of D2×W2×H2. These two process is summarized as:

  • Depthwise convolution, i.e. a spatial convolution performed independently over each channel of an input.
  • Pointwise convolution, i.e. a 1x1 convolution, projecting the channels output by the depthwise convolution onto a new channel space.

The parameters used is instead of:

number of parameters=(F1×F2×D1+1)+(1×1×D1+1)×D2

Now, using pytorch, let us compare how we define a convolution and a depthwise separable convolution:

# normal convolution
normal_layer = nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2) # 16 is input channel, and 32 is the output channel
# depthwise separable convolution
layer_1 = nn.Conv2d(16, 1, kernel_size=5, stride=1, padding=2)
depthwise_layer = nn.Conv2d(16, 32, kernel_size=1, stride=1, padding=0)
out = layer_1(x)
out = depthwise_layer(out)
A comparison between standard convolution and depthwise convolution: https://ikhlestov.github.io/pages/machine-learning/convolutions-types/

Flattened Convolutions[2]

Spatial and Cross-Channel convolutions

Pooling Layer

Nowadays, a CNN always exploits extensive weight-sharing to reduce the degress of the freedom of models. A pooling layer helps reduce computation time and gradually build up spatial and configural invariance. For image understanding, pooling layer helps extract more semantic meaning.

Good to know

Know that you know how a convolutional layer works, it’s time to cover some useful details:

  • Number of parameters: When you are designing your network, number of trainable parameters significantly matters. Therefore, it is good to know how many parameters your convolutional layer would add up to your network. What you train in a convolutional layer are its filters and biases. Then, you can easily calculate its number of parameters using the following equation:

number of parameters=(Fw×Fh×di+1)×do

where di, and do are depth (# of channels) of the input and depth of the output, respectively. Note that the one inside the parenthesis is to count the biases.

  • Locally-Connected Layer: This type of layer is quite the same as the Convolutional layer explained in this post but with only one (important) difference. In the Convolutional layer the filter was common among all output neurons (pixels). In other words, we used a single filter to calculate all neurons (pixels) of an output channels. However, in Locally-Connected Layer each neuron (pixel) has its own filter. It means the number of parameters will be multiplied by the number of output neurons. It drastically could increase the number of parameters and if you do not have enough data, you might end up with an over-fitting issue. However, this type of layer let your network to learn different types of feature for different regions of the input. Researchers, got benefit of this helpful property of Locally-Connected Layers specially in face verification such as DeepFace and DeepID3. Still, some other researchers use a distinct filter for each region of the input instead of each neuron (pixel) to get benefit of Locally-Connected Layers with less number of parameters.
  • Convolution layers with 1X1 filter size: Even though using a 1X1 filter does not make sense at first glance in image processing point of view, it can help by adding nonlinearity to your network. In fact, a 1X1 filter calculate a linear combination of all corresponding pixels (nuerons) of the input channels and output the result through an activation function which adds up the nonlinearity.

Normal convolution:

If we have 3 input channels I3, we want 5 output O5 channels, then the filters have the size of 5, each has 3 channels. We can represent them as W with O*I size. So for each input channel I_i, we can do convolution simultaneously with W1_i, W2_i, W3_i, W4_i and W5_i. Different input channels did not share weights between them.

The number of parameters are (Fw*Wh*do+1)*d1

Deconvolution

Locally-Connected Layer:

In the Convolutional layer the filter was common among all output neurons (pixels). In other words, we used a single filter to calculate all neurons (pixels) of an output channels. However, in Locally-Connected Layer each neuron (pixel) has its own filter. It means the number of parameters will be multiplied by the number of output neurons. It drastically could increase the number of parameters and if you do not have enough data, you might end up with an over-fitting issue. However, this type of layer let your network to learn different types of feature for different regions of the input. Researchers, got benefit of this helpful property of Locally-Connected Layers specially in face verification such as DeepFace and DeepID3. Still, some other researchers use a distinct filter for each region of the input instead of each neuron (pixel) to get benefit of Locally-Connected Layers with less number of parameters.

Convolution layers with 1X1 filter size: Even though using a 1X1 filter does not make sense at first glance in image processing point of view, it can help by adding nonlinearity to your network. In fact, a 1X1 filter calculate a linear combination of all corresponding pixels (nuerons) of the input channels and output the result through an activation function which adds up the nonlinearity.

Deformable Convolution (Notes from here)

  • Deformable convolution consists of 2 parts: regular conv. layer and another conv. layer to learn 2D offset for each input. In this diagram, the regular conv. layer is fed in the blue squares instead of the green squares.
  • If you are confused (like I was), you can think of deformable convolution as a “learnable” dilated (atrous) convolution which the dilated rate is learned and can be different for each input. Section 3 is a great read if you’d learn more about the relationship of deformable convolution with other techniques.
  • Since the offsets are not integer (fractional), bilinear interpolation is used to sample from the input feature map. The author points out that this can computed efficiently. (see Table 4 for forward-pass time)
  • The 2D offsets are encoded in the channel dimension (e.g. conv. layer of n channels is paired with offset conv. layer of 2n channels)
  • Note that offsets are initialized to 0 and the learning rate for these offset layers are not necessarily the same as the regular convolution layer (but they are by default in this paper)
  • The authors empirically show that deformable convolution is able to “expand” the receptive field for bigger object. They measure “effective dilation” which is the mean distances between each offsets (i.e. the blue squares in the Fig. 2). They found that deformable filters that are centered on larger objects has larger “receptive field”. See below.
From Fig. 5. Red dots are sampling locations (from the learned offset) of a deformable convolution filter. Green squares are corresponding outputs. Filter on larger object has larger receptive field.

Deformable ROI Pooling

  • Deformable RoI pooling also consists of 2 parts: regular RoI pooling layer and another fully connected layer to learn the offset.
  • Instead of predicting the raw offset (in pixel), the offsets are normalized (i.e. divided) by the width and height of the RoI region such that is it is invariant to RoI size.
  • There is a curious constant scalar gamma which further scale the normalized offset. (?)

Depthwise Separable Convolution (More can be found here)

In neural networks, we commonly use something called a depthwise separable convolution. This will perform a spatial convolution while keeping the channels separate and then follow with a depthwise convolution. In my opinion, it can be best understood with an example.

Let’s say we have a 3x3 convolutional layer on 16 input channels and 32 output channels. What happens in detail is that every of the 16 channels is traversed by 32 3x3 kernels resulting in 512 (16x32) feature maps. Next, we merge 1 feature map out of every input channel by adding them up. Since we can do that 32 times, we get the 32 output channels we wanted.

For a depthwise separable convolution on the same example, we traverse the 16 channels with 1 3x3 kernel each, giving us 16 feature maps. Now, before merging anything, we traverse these 16 feature maps with 32 1x1 convolutions each and only then start to them add together. This results in 656 (16x3x3 + 16x32x1x1) parameters opposed to the 4608 (16x32x3x3) parameters from above.

The example is a specific implementation of a depthwise separable convolution where the so called depth multiplier is 1. This is by far the most common setup for such layers.

We do this because of the hypothesis that spatial and depthwise information can be decoupled. Looking at the performance of the Xception model this theory seems to work. Depthwise separable convolutions are also used for mobile devices because of their efficient use of parameters.

Conventional Convolution

A KxK convolution with stride S is the usual sliding window operation, but at every step you move the window by S elements. The elements in the window are always adjacent elements in the input matrix. For S=1, you have the standard convolution. For S>1 you obtain a down-sampling effect. You can also generalize this operation to 0<S<1 (fractionally strided convolution); in this case, you obtain an up-sampling effect.

This convolution is invariant translation (invariant to location), see details.

Parameter sharing:

Cross-Correlation

Dilated Convolution also called Atrous Convolution

Explanation of Convolution

A D-dilated KxK convolution is different. It is even called “convolution with dilated filter”, because it is equivalent to dilating the filter before to do the usual convolution. Dilating the filter means expanding its size filling the empty positions with zeros. In practice, no expanded filter is created; instead, the filter elements (the weights) are matched to distant (not adjacent) elements in the input matrix. The distance is determined by the dilation coefficient D. The image below shows how the kernel elements are matched to input elements in a D-dilated 3x3 convolution

For plain old convolution this would be ftτ. In the dilated convolution, the kernel only touches the signal at every lth entry. This formula applies to a 1D signal, but it can be straightforwardly extended to 2D convolutions.

Kronecker Product

Skip Connection

ResNet and its constituent residual blocks draw their names from the ‘residual’ — the difference between the predicted and target values. The authors of ResNet used residual learning of the form H(x) = F(x) + x. Simply, this means that even in the case of no residual, F(x)=0, we would still preserve an identity mapping of the input, x. The resulting learned residual allows our network to theoretically do no worse (than without it).

Residual connections and skip connections are used interchangeably. These types of connections can skip multiple layers (see page 4 of the original ResNet paper), not just one. In short, residual connections are used to make deeper networks easier to optimize.[1]

Peephole Connections

Peephole connections redirect the cell state as input to the LSTM input, output, and forget gates. You can explore these in detail by reading the original papers from Professor Felix Gers [2][3]. These connections are used to learn precise timings.

Reference:

[1] Kaiser, Lukasz, Aidan N. Gomez, and Francois Chollet. “Depthwise separable convolutions for neural machine translation.” arXiv preprint arXiv:1706.03059 (2017).

[2] Jin, Jonghoon, Aysegul Dundar, and Eugenio Culurciello. “Flattened convolutional neural networks for feedforward acceleration.” arXiv preprint arXiv:1412.5474 (2014).

--

--