Artificial Neural Networks- An intuitive approach Part 1

A comprehensive yet simple approach to the basics of deep learning

Niketh Narasimhan
Analytics Vidhya

--

Contents

  1. Artificial Neural Network
  2. Activation functions
  3. Loss functions

Artificial Neural Network

The human brain is the most sophisticated of all supercomputers.An artificial neural network (ANN) is a technique designed to simulate the way the human brain analyzes and processes information. As a human brain learns through experiences so does an ANN . An ANN has self learning capabilities ie. as more and more data becomes available an ANN can improve its’ predictive/modelling capabilities.

Artificial neural networks are designed to function like the human brain, with neuron nodes interconnected like a web.

ML inspired by the Human brain

An ANN has hundreds or thousands of artificial neurons called processing units, which are interconnected by nodes. These processing units are composed of input and output units. The input units receive various forms and structures of information based on an internal weighting system, and the neural network attempts to learn about the information presented to produce one output report.

Just like humans need a set of rules and guidelines to process information into a result, ANNs are programmed with a set of learning rules called backpropagation, (backward propagation of error), to improve their output results.

An ANN initially goes through a training phase where it learns to recognize patterns in data, whether visually, aurally, or textually. During this supervised phase, the network compares its actual output produced with what it was meant to produce — the desired output. The difference between both outcomes is adjusted using backpropagation. This means that the network works backward, going from the output unit to the input units to adjust the weight of its connections between the units until the difference between the actual and desired outcome produces the lowest possible error.

Let us deep dive into what exactly is an ANN structure!!!

ANN Structure

Perceptron

The above structure represent an ANN in it’s most basic forms also called as a Perceptron

A set of inputs denoted as {x1, x2 ,…..xm) each fed into its own connection with a weight denoted as (w1, w2 ,………wm).Every connection has a weight attached which may have either a positive or a negative value associated with it. The neuron sums all the signals it receives, with each signal being multiplied by its associated weights on the connection.

This output is then passed through a transfer /activation function, g(y), that is normally non-linear to give the final output ..

The back-propagation ANN is a feed-forward neural network structure that takes the input to the network and multiplies it by the weights on the connections between neurons or nodes; summing their products before passing it through a threshold function to produce an output. The back-propagation algorithm works by minimizing the error between the output and the target (actual) by propagating the error back into the network. The weights on each of the connections between the neurons are changed according to the size of the initial error. The input data are then fed forward again, producing a new output and error. The process is reiterated until an acceptable minimized error is obtained. Each of the neurons uses a transfer/activation function and is fully connected to nodes on the next layer. Once the error reaches the desired value, the training is stopped. The final model is thus a function that is a internally representing of the output in terms of the inputs at that point. A more detailed discussion of the back-propagation algorithm will be carried out in upcoming articles.

Activation functions:

( Kindly re-read this topic post covering all the posts as it contains terms which would be explained later but however covering these here since it would add to the understanding now as well.)

Let us take an example of a binary classification, what would the activation function be, any guesses??

Binary classification/sigmoid function

The above model is an exact replica of the logistic regression model.The sigmoid/Logistic function is used in the above case.

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function.

Linear activation functions:

Binary Step Function

A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

The problem with a step function is that it does not allow multi-value outputs — for example, it cannot support classifying the inputs into one of several categories.

Linear Activation Function

A linear activation function takes the form:

A = cx

It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no.

However, a linear activation function has two major problems:

1. Not possible to use backpropagation (gradient descent) to train the model — the derivative of the function is a constant, and has no relation to the input, X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction.

2. All layers of the neural network collapse into one — with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.

A neural network with a linear activation function is simply a linear regression model. It has limited power and ability to handle complexity varying parameters of input data.

Note: Backpropagation will be covered in depth later

Non-Linear Activation Functions

Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.

Almost any process imaginable can be represented as a functional computation in a neural network, provided that the activation function is non-linear.

Non-linear functions address the problems of a linear activation function:

  1. They allow backpropagation because they have a derivative function which is related to the inputs.
  2. They allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.

Nonlinear Activation Functions and How to Choose them

Sigmoid

Sigmoid / Logistic

Advantages

  • Smooth gradient, preventing “jumps” in output values.
  • Output values bound between 0 and 1, normalizing the output of each neuron.
  • Clear predictions — For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.

Disadvantages

  • Vanishing gradient (Will be covered in depth later) — for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction.
  • Outputs not zero centered.
  • Computationally expensive

TanH / Hyperbolic Tangent

Advantages

  • Zero centered — making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
  • Otherwise like the Sigmoid function.

Disadvantages

  • Like the Sigmoid function
ReLU

ReLU (Rectified Linear Unit)

Advantages

  • Computationally efficient — allows the network to converge very quickly
  • Non-linear — although it looks like a linear function, ReLU has a derivative function and allows for backpropagation

Disadvantages

  • The Dying ReLU problem — when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.
Leaky ReLU

Leaky ReLU

Advantages

  • Prevents dying ReLU problem — this variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values
  • Otherwise like ReLU

Disadvantages

  • Results not consistent — leaky ReLU does not provide consistent predictions for negative input values.
Parametric ReLU

Parametric ReLU

Advantages

  • Allows the negative slope to be learned — unlike leaky ReLU, this function provides the slope of the negative part of the function as an argument. It is, therefore, possible to perform backpropagation and learn the most appropriate value of α.
  • Otherwise like ReLU

Disadvantages

  • May perform differently for different problems.
Softmax

Softmax

Advantages

  • Able to handle multiple classes only one class in other activation functions — normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class.
  • Useful for output neurons — typically Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.
Swish

Swish

Swish is a new, self-gated activation function discovered by researchers at Google. According to their paper, it performs better than ReLU with a similar level of computational efficiency. In experiments on ImageNet with identical models running ReLU and Swish, the new function achieved top -1 classification accuracy 0.6–0.9% higher.

Finding the best weights/coefficients (The loss function)

A loss function is a method of evaluating how well s specific algorithm models the given data. If predictions deviate too much from the actual values , the loss function would cough up a very large number. therefore we define a goodness metric (optimization function) to define how good the fit(for regression problems) or separation (for classification problems)

Ideal properties of an loss function

  1. Robust: The result does not drastically explode due to the presence of outliers.
  2. Non-ambiguous: Multiple co-efficient values should not give the same error.
  3. Sparse : Should use as little data as possible.
  4. Convexity: It should be convex.
Convexity of a loss function
Loss functions cheat sheet

Regression loss functions:

As can be seen in the diagram below regression losses are simple and self explanatory ,the squared loss(l2) is less robust than absolute loss (l1)due the the presence of squared terms, L2 loss is easily differentiable as compared to L1, Huber’s loss is more robust and differentiable as it combines the best of L1 and L2 losses.

Common loss functions regression

Classification loss functions:

Binary classification:

Exponential Loss:

Logistic Loss:

Logistic loss function

Binary Hinge loss:

Hinge Loss

To better understand the concept of hinge loss , let us take actual and predicted values, and let us chose the margin as K=0.20

The table above illustrates hinge loss for a hypothetical SVM(subject vector machines). The goal is binary classification. Items can be class -1 or +1 (for example, male / female, or live / die, etc.). An SVM classifier accepts predictor values and emits a value between -1.0 and +1.0 for example +0.3872 or -0.4548. (Actually, that’s not entirely true, but assume it is — the following explanation doesn’t change).

If the computed output value is any positive value, the prediction is class +1 and vice versa.

But, SVM has a notion of a margin. Suppose the margin is 0.2 and a set of actual and computed values is as shown in the table. Here’s what’s going on:

For item [0], the actual is +1 and the computed is +0.55 so this is a correct prediction and because the computed value is greater than the margin of 0.2 there is no hinge loss error.

For item [1], the actual is +1 and the computed is +0.25 so the same situation occurs.

For item [3], that actual is +1 and the computed is -0.25 so the classification is wrong and there’s a large hinge loss.

For item [6], the actual is -1 and the computed is -0.05 the classification is correct but there is a moderate hinge loss because the computed is too close to zero.

For item [7], the actual is -1 and the computed is +0.25 so the classification is wrong and there’s a large hinge loss. Notice the symmetry with item [3].

Multiclass Classification:

Hinge Loss/Multi class SVM Loss

In simple terms, the score of correct category should be greater than sum of scores of all incorrect categories by some safety margin (usually one). And hence hinge loss is used for maximum-margin classification, most notably for SVM’s. Although not differnentiable, it’s a convex function which makes it easy to work with usual convex optimizers used in machine learning domain.

Cross Entropy Loss

Gradient Descent:

For the mathematical intuition and understanding of Gradient descent , kindly go through the below link (Ann] excellent article on gradient descent)

--

--