Basics of Artificial Neural Network and Convolutional Neural Network

B.Thushar Marvel
11 min readApr 9, 2022

--

Artificial neural network

A Neural Network is also called an Artificial Neural Network. It is used in deep learning models.

Neural networks are basically imitating the working of the human nervous system .In simple words it is like neurons in a human being’s nervous system. Artificial neural network is composed of a combination of neurons. It is basically creating the artificial human beings nervous system like structure. Each neuron in the network is connected to multiple adjacent neurons, helping in sharing the information from one neuron to another.

These networks are used to work on incomplete data and also helps in extracting the relevant features from training data. As we can see in the above figure, there are different layers in the neural network. These layers are generally divided into 3 regions.

Input Layer: This is the layer where training data is fed.

Hidden layer: These are the middle layers, which are usually used to extract complex relations within the data. Increase in the number of these layers helps in extracting more complex relations which in turn cause high computation power requirements.

Output Layer: This layer is the final output layer. The number of neurons in this layer is equal to the number of classes.

Below image shows the working of each neuron.

Here x1,x2, x3 are the inputs to the neuron and b is the bias and w1,w2,w3 are the weights for the inputs , a is output from the neuron.

Here f is the activation function used to maintain nonlinear relationship between input and output

The output a is the weighted sum of input plus bias along with activation function

Activation functions can be of any type logistic, exponential, linear, ReLu etc

As we already discussed about the ANN, the input for ANN is a node/ neuron. In image classification/detection each pixel of the image represents the neurons or nodes as input to ANN. Even if we take an image of size 224x224x3 which when converted into 1 dimension will make an input vector of 150528. So, this input vector is still too large to be fed as input to the neural network. So passing the entire or original image into ANN requires high computational power and is time consuming.

Now comes the use of CNN. It helps in extracting only the relevant features from the images. This is used for extracting the features from the image and classifying it. For doing this we use a bunch of images and manually label all the images according to objects in it. Like we label trees, cars, signals, people, footpaths, trucks, cycles, animals etc and later use this labelled data to train our CNN.

Convolution Neural Network (CNN)

Convolutional Neural Network is a deep learning algorithm which uses the concept of ANN. It takes input data and learns the feature by its own. It is a spatial kind of neural network that performs convolution operation on the input data. Convolution operation perform better for any kind of 2D data (Image, 2D audio signals and so on). This method is currently the state of art technique in image classification field of research. Input images for CNN are of 3 dimensions i.e. height, width and number of channels. First two dimensions tell us the image resolution and third dimension represent the number of channels (RGB -intensity values for red, green and blue colors.

During training, CNN model learns the features from the image by means of randomly initialized parameters. These parameters get updated during training to reduce the loss function. Then when any new such image is fed into the network it will then be able to identify such objects and can take driving decisions according to objects identified in the scene.

CNN consists of :

∙ Convolution layer

∙ Activation layer (e.g. using ReLu)

∙ Pooling layer

∙ Dropout Layer

∙ Fully connected Layer

Convolution Layer:

This layer is the core layer of CNN. It is a simple application of a filter over the image. This repeated application of the filter results in a map of activations called feature map. A filter or kernel of known size is made to run over the image both horizontally and vertically in fixed gap intervals called strides. During running the filter over the image, the dot product of the filter with part of image on which the filter lies is calculated. Then the sum of all values of the product matrix is copied to the corresponding position in the convolved feature map matrix. Thus, we get a reduced dimension feature map of the image.

Filters may be of any kind. Each filter is used to extract different kinds of features from the image. For e.g. one filter may be responsible for extracting one kind of feature from the image based on shapes and edges and another filter may be used to extract features based on colour intensities.

CNN’s performance depends on these parameters:

Stride — This defines the number of pixels by which we have to move our filter both vertically and horizontally over the image so that we can focus on a new set of pixels while doing convolution

Padding — It is a process of adding zeroes around the border of the original image symmetrically. This is done to overcome the loss of information present in the edges of images while processing. This helps us obtain the feature map output to be of size as per our requirement. If no padding is done, then it is said as valid padding and if we need input and output shape of image same after convolving, then it is same padding

Filters — These are also called kernels. These may be of any type. Each filter increases the depth of the output generated after convolution. So, if we are using 3 filters then the depth of the output will be 3. There are 3 parameters on which the output of convolution depends i.e. Stride, Depth and Padding. We must finely tune these parameters to obtain the desired output.

Input size- n X n X nc

Filter size — n X n X nc

Padding is P, stride is s and nc is depth of image (number of channels)

The output feature map size for a convolution operation is calculated as

Suppose for example if input image shape is 32x32x3 and filter size is 5x5x3 with 10 filters, with p=0(no padding/valid convolution) and s=1

Then output size will be (32 + 2*0–5)/1 + 1 = 28

So output size is 28x28x10

Total number of learning parameters = ((f x f x nc) + 1) x number of filters

Each weight/filter is associated with 1 bias, so 5 x5 x3 has 75 +1 parameters for each filter.

Therefore total parameters = 76*10 =760

Pooling Layer

This is a layer which operates a small kernel of known size over the image at fixed stride. There are two types of pooling operations, max pooling layer and average pooling layer. When a kernel fits over the part of the image, the maximum value from that group of pixel values is moved forward, this is max pooling. Average pooling selects the average of all pixel values that lie in the kernel. The resultant matrix will be a reduced dimensional matrix of feature image. This helps in reducing the unnecessary sparse cells of image which are of no use in classification. Max pooling helps reduce the dimensionality of the network which in turn reduces the computational power requirement.

Activation Layer

Activation functions are used to maintain non linear relation (except linear function) between input and output.

Some of the activation functions are tanh,sigmoid, softmax , linear and ReLU.

Sigmoid is a non linear activation function. The output for this function ranges from (0–1).

As we see in the figure, when x is 0, the output is 0.5. This S shaped curve function has the disadvantage of poor learning due to vanishing gradient which arises during backpropagation.

Tanh is the hyperbolic tangent function. It is very similar to the sigmoid function. It ranges from -1 to +1. It also leads to the vanishing gradient problem in very deep neural networks

Linear is an activation function with linear relation between output and input. Linear functions don’t have boundary conditions. This function won’t affect the weighted sum of input value.

Softmax is a non linear activation function typically applied on the last layer on a neural network. It is also a type of a sigmoid function. It’s defined as

This function predicts the value for every output neuron ranging from 0 to 1.Hence helps in predicting the class probability. In the above equation N is the total number of classes,z is the input, and σ(z)i is the output of softmax function(0–1).

ReLU (Rectifier Linear Unit) is one of the activation functions. Its formula is max(x, 0) means if the resultant value coming from node is positive then output would be the same positive value and if it is negative then output would be zero. This results in overcoming the vanishing gradient problem.

Dropout Layer

These are the layers used to reduce overfitting in neural networks. It reduces neuron laziness. Generally applied after a fully connected layer. It randomly drops out the weights of some input neurons in every iteration so that the other neurons get more weightage in that iteration. Usually people take dropouts of 20–50%. By using a dropout layer, normally the performance of a network increases 1 to 2 %.

Fully Connected Layer

These layers are nothing but the artificial neural network. The structure of this layer forms a pyramid like structure. This layer consists of the highest number of learning parameters. The input for this layer is similar to ANN. The number of output layers depends on the number of classes. Increased number of hidden layers helps in better learning of the model, but model complexity increases and also it requires high computational power.

Loss Function

The process of calculation of the loss produced by the model is called Loss Function. In neural network loss is nothing but the error in prediction.. Loss function acts as a guide to neural networks during training. This is used to find the gradients and based on that gradients the parameters are updated during back propagation while training. It is the method of evaluating the trained model on unseen data. If prediction error deviates large then loss function is high.

Mean Square Error ( MSE ):

Mean Bias Error ( MBE ):

Cross Entropy Loss:

Forward Propagation and Back propagation

The input data is fed in the forward direction through the network. Hidden layer accepts the input data, processes it as per the activation function and passes to the successive layer.

The basic back propagation algorithm is based on minimizing the error of the network using the derivatives of the error function. Calculation of the derivatives flows backwards through the network, hence the name, back propagation. These derivatives point in the direction of the maximum increase of the error function A small step (learning rate ) in the opposite direction will result in the decrease of the (local) error function.

Randomly initializing weights and updating these weights using back propagation algorithms.

These back propagation algorithms calculate the loss functions and based on the loss function, it finds the partial derivative(gradient) of loss functions with respect to every weight. Based on gradient, the weights are updated in such a way that the loss function gets decreased continuously and reaches local minimum point.

Wnew new weight, Wold — previous weight, α- learning rate and

– partial derivative of loss function with respect to weight

The learning rate is important. If the learning rate is too small it converges extremely slowly. On the other hand if learning rate is high it may not converge to the solution.

Optimizers

Optimizers are used to optimize the model learning process by updating the weight parameters to minimize the total loss of the neural network. These are algorithms which use loss function as reference, based on loss function it continuously updates the parameters by keeping the goal of minimizing the loss function. There are several optimizer use in training a neural networks,

SGD Stochastic Gradient Descent. It calculates gradients and using this grading it updates weights values.

Adagrad -Adaptive Gradient Algorithm

Adadelta — Adadelta is an optimizer, which is the updated version of Adagrad

RMSProp — It is one type of optimizer also called as Root Mean Square Propagation.

Adam It is Adaptive Moment Estimation. This optimizer basically calculates the individual learning rate for all parameters in the model.

Adam optimizer is one of the most popular gradient descent optimization algorithms. In our experiment we use Adam as our optimizer for classification tasks.

Metrics for classification

Metrics for evaluating the performance of neural network models. There are several metrics used for measuring performance of deep learning models.

Confusion matrix

Let use the example of detecting disease in plants

TN- True negatives, i.e., plants which did not have disease for which we correctly predicted as no disease.

TP- True positives, i.e plants which have disease for which we correctly predicted as disease.
FN- False negatives, i.e., plants which have disease for which we predicted as no disease.

FP- False positives, i.e., plants which did not have disease for which we incorrectly predicted as disease.

which tells us what proportion of plants we predicted as having disease actually had disease. In other words, the proportion of TP in the set of positive disease.

which tells us what proportion of plants that actually had disease were predicted by us as having disease. In other words, the proportion of TP in the set of true disease states.

F1 score is the harmonic mean of Precision and Recall.

Accuracy is the ratio of correct prediction to the total number of data used.

Please a clap/follow if the content is useful
Thank you

--

--