Introduction to Convolutional Neural Networks
The neural networks find their inspiration from the biological neuron which acts as a fundamental unit of the nervous system of living beings. The first attempt to model the biological neuron is the Perceptron which is a linear model for binary classification. It consists of input layers with associated weights which are summed up and send to a step function with a definite threshold, typically a Heaviside step function with the value of 0.5. The net input to a neuron is the weights on the connections multiplied by the activation incoming on the connection. A bias term is added to incorporate to account for bias in every layer. The final output from a neuron is the value of the activation function wrap around the net input value. [1]
The Convolutional Network is an architecture which is suitable for two-
dimensional array data and finds inspiration from its biological counterpart [2] where the architecture involves the processing of units with identical weight vectors and arrangement of local receptive fields in a spatial array. Their hierarchical architecture encompasses alternating subsampling layers which are analogous to simple and complex cells in the primary visual cortex [3]. CNNs perform mappings between spatially / temporally distributed arrays in arbitrary dimensions and are generally characterized by the following constraints [1]:
- Translation invariance: spatial translation has no effect on the neural weights
- Local connectivity: neural nodes which are located in spatially local regions have connections
- A progressive decrease in spatial resolution: when there is a gradual increase in the number of features
A classic CNN encompasses alternating layers of convolution and pooling. The convolution layers are tasked to extract patterns in the images which are located in a particular region. This is achieved by computing the inner product of an arbitrary convolving filter and every region of the image to obtain a feature map which is passed through a non-linear function generating activations that are further processed in the pooling layer. The most commonly used pooling functions are average and max-pooling which select the arithmetic mean and maximum of the elements in a particular pooling region, respectively. The alternating convolution and pooling layers extract varied features at each step. Succeeding this is the non-linear function which can be chosen as tanh, logistic, softmax or relu. The final layer is the fully connected layer which outputs unit class in a recognition task [3].
[1] Tan, Y.H. and Chan, C.S., 2019. Phrase-based image caption generator with hierarchical LSTM network. Neurocomputing, 333, pp.86–100.
[2] Patterson, J., 2017. Deep Learning. 1st edition. ed.: O’Reilly Media, Inc.
[3] Boden, M., 2002. A guide to recurrent neural networks and backpropagation. the Dallas project.