Artificial Neural Networks, Part 1

Published in

Analytics Vidhya

6 min readMay 17, 2020

This post is the first of a series, where I will try to explain the concept behind ANNs or Artificial Neural Networks.

We will start with a simple perceptron model and try to understand the intuition behind it without getting into a lot of math.

Whenever we hear or read about ANN or Artificial Neural Networks, the most appropriate analogy that comes in the mind is that of the neurons in our brain. Neurons consist of a lot of connections going in and out of the central node which performs the required operations. Refer to the image below of a Neuron from Wikipedia.

In the image above, we can see a lot of input sources for the nucleus which performs most of the signal processing and passes the output to either the next neuron or to the final destination.

Single Layer Perceptron

Similarly, in an ANN we have inputs, a nucleus or node that performs the calculations and output. To represent an ANN in its simplest form, consider the image below.

In the image, we can see that there are two inputs coming into the node. Node is where the processing happens, much like the nucleus of the neuron. This model here is called a simple perceptron, where we have only one layer of processing. Here f(X) is the function applied to the inputs for getting the desired output.

With neural networks, it is said that they are applicable to a wide variety of applications and are really good at learning. To make that possible, there are a wide variety of parameters that are adjusted until a solution is reached. That parameter is called weight (w).

Weights define the importance of input and how much it contributes to the overall output. If the weight is really low, that means, the input is not important, whereas, if the weights are high it makes the impact of that input to the processing really significant. There can be scenarios, where the weight becomes 0, in that case, whatever the value of the input is, it doesn’t matter. It will always be neglected. To mitigate this issue, we add a bias term (b), which can be a positive or negative value. This makes the value of weights and the input significant, and in turn the node, only when they are higher than the bias term.

The node takes in a weighted sum of the inputs with the bias term and the function f is applied to get the desired results. It is important to remember that the calculation is a multiplication of two vectors W and X.

Here, i is the input number, X is the input value and W is the weight associated with the input, and b is the bias added at the node.

Multi-Layer Perceptron

A multi-layer perceptron consists of multiple single-layer perceptrons connected together to form a network. Here the output of one node becomes the input to the nodes of the next layer. All the nodes perform calculations and the output is sent to the output layer.

Consider the image below

Here the first layer is called the input layer, where data points are passed. In this layer, no calculations are performed. The second, third, and fourth layers with four, four, and three nodes respectively are called hidden layers, because of the complexity which makes them difficult to interpret.

The final layer is called an output layer. In the above example, we only have one output node, however, it can be two or more depending on the type of problems we are trying to solve using the network.

Depending on the type of problem we are solving, it is important to put constraints around the output of the function we are using at every node in the network. For example, in a classification problem, we would like to get the output between 0 and 1, so that it can be interpreted as a probability. These constraints are applied by the use of Activation Functions. Here activation refers to the ON/OFF or the importance of the node in the next layer. It converts the output in the required limits based on the type of problem we are solving.

Here we understood without any mathematics, what a simple and multi-layer neural network means and operates. The different components such as nodes, weights, bias, input and output layers, and why activation functions are required.

There are quite a number of activation functions that can be used. Before we go further, let us revisit the equation and denote this by the letter Z. Hence,

Z = w*X + b

Let's go over the most used ones -

Step Function — This is the simplest function that can be used. It gives the value in terms of 0 or 1. if Z>0 then 1 else 0. The problem with this is that it is a very strong function and the output cannot be used in terms of probabilities and it cannot help with multi-value outputs.

Sigmoid Function — This is relatively familiar if you understand logistic regression. The output is between 0 and 1 and can be used as the probability of one class over others in terms of classification. The issue with Sigmoid function is that it can cause the network to stop learning or become extremely slow when the values of X go really high or low and there is no change in the predictions. This is called the problem of vanishing gradients.

tanh( ) or Hyperbolic Tangent — The problem with using a sigmoid function is, when we pass a value which is negative, the output is near zero, which does not translate well in terms of calculating the other parameters. For overcoming this, we can leverage the tanh( ) function which returns the values between -1 and 1. When supplied a negative value to tanh( ) it will return the output as negative, and only near-zero values are mapped to zero. Tanh also suffers from the same issue of vanishing gradients like sigmoid function.

ReLu or Rectified Linear Unit — This is often the default activation function. It returns the value Z if the output is more than 0 else returns 0. So basically, its max(0, Z). This is mostly used as it allows the network to converge faster and also allows backpropagation. The issue is evident from the image below, that when the value is 0 or negative, the output becomes 0 and the network cannot learn further.

Softmax( ) — This is the widely used function when we are dealing with multi-class classification problem where we have one class per data point. This is also called as mutually exclusive classes. It calculates the probability of one class over all the other classes. The sum of all the probabilities is always 1. For example, if we have 5 output classes, there will be a probability output for each of the 5 classes and we pick the one with the highest probability as the result.

We have gone through a brief introduction to neural networks and the related components such as nodes, weights, bias, activation functions and why are they required in a simple explanation. In the next part of the series, we will go over the concept of Gradient Descent.

Part 2 — Understanding Gradient Descent (without the math)

Artificial Neural Networks, Part 1

Written by Saket Chaturvedi