What is a Neural Network and how do they work?

Joe Rackham
Analytics Vidhya
Published in
14 min readFeb 23, 2021
Photo by Jesse Martini on Unsplash

In ‘Critique of Pure Reason’, the German philosopher Immanuel Kant described two types of knowledge. A priori knowledge is acquired without experience — for example, reasoning that since cats are mammals, and mammals are animals, cats are animals. A posteriori knowledge is obtained from experience, such as discovering that a country is warm in summer by actually visiting it.[1]

Neural networks are computational systems capable of learning how to perform a task without a priori knowledge. They are characterized by an improvement in performance over time. Essentially, when given a set of data, they can derive conclusions about this data without any context or understanding of what the data is.

Structure of a Neural Network

Artificial neural networks emulate the biological neural networks inside the human brain. The atomic element of a neural network is the neuron. Inside the brain, a neuron collect signals from other input neurons. The magnitude of the signal from each input neuron depends on the strength of the connections between each neuron and its input. If the total signal received by the neuron exceeds a certain strength, the neuron fires.[2]

A biological and artificial neuron

As in the brain, artificial neural networks consist of connected neurons, but which output numerical values. Every connection between a neuron and one of its input neurons has a particular weight. The neuron calculates the sum of the outputs from the input neurons multiplied by their respective weights. It then passes this result through an activation function to produce an output value. This activation function limits the range of possible outputs.

How do Neural Networks Learn?

Neural networks improve over time by modifying the weights of the connections between neurons to yield better results. This process is known as learning. A network learns by supervised learning or unsupervised learning. With supervised learning, the network is given some sample input for which the correct output is already known. The network is adjusted by comparing its output with the correct output until it meets an acceptable performance. In unsupervised learning, the network is simply given some input data and the aim is to see how it interprets and structures that data.

Feedforward Neural Networks

Feedforward networks arrange neurons into a series of layers:

  • A single input layer, whose neurons have values — called activations — from information outside the network.
  • One output layer, whose neurons’ activations are the overall outputs of the network.
  • In between the input and output layer, hidden layers, which allow the network to interpret complex patterns. There can be many hidden layers.

Each neuron in a layer is connected to each neuron in the next. These connections always solely in the direction towards the output, hence the name feedforward.

Earlier, it was mentioned that neurons calculate output using an activation function. The activation function used in feedforward networks is known as the sigmoid function:

As can be seen from the graph, this function restricts the range of possible outputs to between 0 and 1.

This is in contrast to the binary output of the perceptron. In the perceptron even small changes to weights could push the weighted sum over the threshold, having a significant impact on the network. This kind of drastic change is not conducive to learning as it can significantly alter the network’s output in a way that is not easy to predict. On the other hand, the smooth, continuous output produced by the sigmoid function is affected more subtly by small adjustments to weights and input activations.

Weights can have any size and be positive or negative. This means some neurons have more of an impact on the activation of a neuron they feed into than others, and that the activation of a particular input increasing may cause the output to decrease. In addition, a bias value is sometimes added to the weighted sum so that the activation is inclined towards a certain value.

Feedforward networks implement supervised learning; they are built to learn tasks for which there is a correct output for every input. When a Feedforward Network is first created, all the weights and biases are either randomized or set to a default value. Either way, it is unlikely that the network will generate better-than-random output. Hence, before the network can be used, it must go through a training stage using the Backpropagation Algorithm.

Backpropagation

Gradient Descent

Consider a function with multiple variables, f(x1, x2, …xk) f(x1, x2, …xk). The partial derivatives of the function are the derivatives with respect to each one of these variables, while the other variables are constant. They are denoted:

The gradient is a vector containing all the partial derivatives:

Traveling in the direction given by the vector at a certain point makes the function increase at the fastest possible rate. The magnitude of the vector corresponds to this rate of increase of the function per unit vector traveled. It follows that the negative of the gradient gives the direction which makes the function decrease most rapidly. Gradient descent is an iterative process of ‘moving’ around the function in the direction of the negative of the gradient to find a local minimum of the function. With each iteration, the gradient at the new point is computed. The distance moved is proportional to the magnitude of the gradient.

Cost Function

A cost function is a measure of how accurate a neural network is, i.e. how close its outputs are to the desired outputs. A simple cost function is the sum of the squares of the differences between each output neuron’s actual value and its expected value. If the output neurons have actual activations a1, a2, …, ak and expected activations y1, y2, … yk the cost, CC, is:

The Backpropagation Algorithm

The aim is to minimize the output of the cost function, since the lower the cost is, the higher the accuracy of the network. To reduce the cost, the activations of the neurons in the output layer need to change. The activation of a neuron can be changed by modifying the adjusting the weights of the neuron’s input neurons, modifying its bias, or changing the activations of the input neurons. To change the latter requires modifying those neurons’ own weights, biases, or their own inputs’ activations. Backpropagation is a recursive process that works backwards through the network, adjusting weights in the direction that will reduce the cost.

This algorithm involves lots of complex matrix maths. Those interested can read about specifics here.

Issues with Backpropagation

The approach to using Backpropagation with a Feed-Forward Neural Network described so far is a naïve and simplistic approach. Improvements and techniques have been developed to accelerate this initial approach and mitigate some of its issues. Firstly, the method described above takes a long time. The approach converges slowly on a low-cost value and takes many iterations to achieve acceptable accuracy. Therefore, many adjustments to the standard method have been suggested to try and speed up convergence time. The standard way of calculating the cost, as was used above, is called the quadratic error. However, in 1987, Franzini reported a 50% reduction in learning time using the following:[12]

The concept of stochastic gradient descent was also introduced to reduce the computational load associated with training a feed-forward neural network. In this model, we randomly select a ‘mini-batch from the entirety of the training data. We assume that the changes to the weights and biases suggested by this mini-batch will be similar to what we would have obtained from the whole training set. This method can achieve similar success to training on the whole set, but with less computational effort required.

The aim is to have a network that generalizes effectively. This means that “the input-output relationship computed by the network is correct (or nearly correct) for input/output patterns never used in training the network.”[13] There is a risk of over-fitting/over-generalization, where the network begins to memorize the training set itself. In this situation, the correct output for training data can be recreated almost perfectly, but the network fails to understand new input that is ‘similar’ to the training example. It simply detects that they are different and fails to produce valuable output.

A useful analogy to this problem is that of fitting a polynomial to a set of data points. Overfitting is analogous to choosing a degree for the polynomial that is too large. The curve still fits the data points provided but fails to capture the general shape of the input-output relation.

The most basic way to avoid overfitting is to use a sufficiently large set of training data.

Another simple method to encourage valuable generalization is to use a training data set that is representative of the larger data set. There are two types of generalization, interpolation, and extrapolation. If we are attempting to predict the output for a new value that is surrounded by training cases then we are interpolating. If the new input lies away from any training examples then we are extrapolating. Networks are consistently successful when interpolating but extrapolating is more issue prone and the performance is often problematic. Having a representative data set reduces the need for extrapolation and therefore improves performance.

Early Stopping

Early stopping is another method to improve generalization. In this method, the normal training data is split between training and validation. The training subset is used to for training as usual but the validation subset is used to keep track of the network’s correctness with data not used in training. Training is stopped if the validation error begins to increase.

However, this is not always practical. In many situations, the relationship between input and correct output must be established by hand, which means it is often costly. Also, in some scientific or historical applications training data is difficult to find. Furthermore, in Early Stopping the validation data does still influence training, therefore validation error is not necessarily indicative of generalization error.

Limitations of Gradient Descent

While gradient descent finds a local minimum of a function, it does not guarantee that this local minimum is the global minimum of the function. The local minimum may be higher than the global minimum, as illustrated here:

Adjusting the Network Structure

A better way to resolve issues with backpropagation may be to change the network structure. There are processes to adjust the number of neurons in hidden layers. Pruning Methods begin with a large solution and attempt to reduce it, removing redundant and problematic parts of the network to achieve a simpler and more performant solution. One such method is to modify the error function in a process known as ‘Weight Decay’. It is based on the idea that especially large weights are problematic. Large weight values in connections to hidden neurons cause discontinuities in the output function, while large weight values in connections to output neurons can produce outputs that are far outside the range of the data. Weight Decay works on the principle of introducing a penalty term to the error function to decrease weights. An example of an error function that achieves this is the following:

where E0 is the standard quadratic difference and γγ is a constant used to control the value of the second term. In this approach, all weights tend to 0 at the same rate. The process can be improved by modifying the formula to make smaller weights decrease more quickly so that they can be removed sooner. An alternate error function accomplishes this:

Another approach to Pruning is to use sensitivity-based methods. The key principle is to calculate which neurons and connections are least important for the network to be able to complete the required task and to eliminate them. The reduced network then passes through further training. Early attempts at sensitivity-based training would remove a connection, test if it caused the error to increase significantly, and then put it back if needed. However, it is far more effective to test all connections and remove the one with the least impact on the cost. Regardless, both these methods are very time-consuming. A more sophisticated heuristic method is that of ‘Skeletisation’, developed by Mozer and Smolensky. The partial derivative of the error function with respect to a connection being inspected is used to compute a ‘relevance value’ for the connections. Connections deemed irrelevant are removed. ‘Optimal Brain Damage’ is another Sensitivity based method. It uses the second derivative of the error with respect to a particular connection to compute the saliency, Sij— importance — of that connection:

This value indicates how sensitive the error is to small changes in the connection so that small weights that have a big impact on the end result are not removed.

Instead of removing elements from a network, other methods progressively add to a network. One so-called constructive method is Cascade Correlation, which builds a tree-like network in a bottom-up pattern. The algorithm begins by directly connecting the input and output layers. The network is trained to learn as many associations as it can. The error is measured. If it is below some threshold chosen by the network’s creator then the algorithm stops. If not, a new hidden neuron is added which is connected to every input and pre-existing hidden neuron, of which there are initially none. All weights in the network are frozen apart from those which input to the new neuron. These weights are adjusted to achieve the maximum correlation between the new neuron’s output and the network’s output error. This is achieved by maximizing the following:

The input to the new neuron is then also frozen and it is connected to the output neurons. The network is retrained, only adjusting these connections. The error is again measured and compared to the same threshold that was used initially. If the network, with the newly added neuron, has sufficiently improved so that now has an error less than the threshold the algorithm terminates. If the network still produces too high an error value then another neuron is added and the process repeats. The algorithm iterates until the point at which the network’s error is finally less than the threshold. It is imperative when choosing the threshold that we pick a reasonable value; for example, a threshold of 0 would require a perfect network which, for most problems, would mean that the algorithm iterates forever.

Applications of Neural Networks

Time series are series of data points organized into time order, this way of representing information is commonly used with money (stock markets, exchange rates) and weather. Feedforward neural networks have been shown to be especially capable of predicting time series data.[15] A successful attempt to model the power output of a wind turbine using wind forecast data and a Feedforward Neural Network is detailed here.

A further scientific usage is detailed in the following, which describes how Feed Forward Neural Networks have been used to model the movement of gasses along a porous wall.

Another area where Feed-Forward Neural Networks have shown aptitude is in recognizing Handwritten Characters. An early example of this is was Le Cun’s attempts in 1989 to recognize Zip Codes. Le Cun divided the characters into a 16 x 16 grid of pixels. These pixels were fed through a network with 12, 12, and 30 neurons in its three hidden layers. Le Cun was able to achieve an error of only around 5% with 2000 training digits.[16] Similarly, in his book ‘Neural Networks and Deep Learning’, Michael Nielson builds a neural network from scratch to learn to recognize numerical characters. The book details how some of the techniques such as stochastic gradient descent and batch mode helped him to improve the network’s accuracy. The applications for effective character recognition are far-reaching — it would allow for uncountable administrative tasks to be automated.

Many applications have been found for neural networks in Chemistry, with the number of articles concerning applications of neural networks in Chemistry “exponentially increasing”[14]. Neural Networks have been applied to Spectroscopy, Protein Folding, Process Control. An article describes how a network is utilized to predict NMR Chemical Shifts.

Perhaps most excitingly, in the late 1980’s Carnegie Melon developed ALVINN, a retrofitted army ambulance capable of using its onboard computer and a feedforward neural network to drive without human interaction. ALVINN was originally trained with images and could achieve speeds of around 5mph. Later versions fed back some of the output of the network to be input, meaning that they did not strictly stick to the feed-forward mode. However, any changes allowed for ALVINN to be trained from a human actually driving it; after a few minutes the driver could press a button and ALVINN would take over the steering. ALVINN has been shown to be able to drive safely at over 70 mph and can even correctly recognise which lane it needs to drive in on two-way roads. This article explains ALVINN in more detail and the attached video shows ALVINN being used on the Carnegie Melon campus.

References

  1. I.Kant, Critique of Pure Reason, Germany, Henry G. Bohn, 1855
  2. D. Svozil, V. Kvasnicka and J. Popichal, ‘Introduction to multi-layer feed-forward neural networks’, Chemometrics and Intelligent Laboratory Systems, vol 39, no 1, 1997, p. 44
  3. C. Lemarechal, ‘Cauchy and the Gradient method’, Documenta Mathmatica Extra Volume, 2010, p. 253
  4. E. M. Cliff, ‘In Memory of Henry J. Kelly’, Journal of Optimization Theory and Applications, vol 60, no 1, 1989
  5. M. Minsky and S. Papert, Perceptron: An introduction to Computational Geometry, MIT, MIT Press, 1969
  6. J. Anderson, An Introduction to Neural Networks, MIT, MIT Press, 1995, p. 220
  7. P. J. Werbos, The Roots of Backpropagation, Harvard, Wiley, 1994
  8. D. Rumelhart, G. Hinton, R. Williams, Learning representations by back-propagating errors, Carnegie-Melon, 1986, p. 1
  9. D. S. Modha, ‘Introducing a Brain-inspired Computer’, http://www.research.ibm.com/articles/brain-chip.shtml , 2012, (Accessed February 2018)
  10. D. Svozil, V. Kvasnicka and J. Popichal, Op. cit. p. 46
  11. D. Svozil, V. Kvasnicka and J. Popichal, Op. cit. p. 46
  12. M. A. Franzini, ‘Speech Recognition with backpropagation, IEE 9th Annual Conf. Engineering in Medicine and Biology Society, vol 9, no 1, 1987, pp. 1702–1703
  13. D. Svozil, V. Kvasnicka and J. Popichal, Op. cit. p. 47
  14. D. Svozil, V. Kvasnicka and J. Popichal, Op. cit. p. 52
  15. H. Bourland, Y. Kamp, ‘Auto-Association by Multilayer Perceptrons and Singular Value Decomposition’, Biological Cybernetics, vol 59, no 1, 1988
  16. Y. Le Cun, ‘Back-propagation Applied to Handwritten Zip Code Recognition. Neural Computation, vol 1, 1989

--

--

Joe Rackham
Analytics Vidhya

Professional Software Engineer @ Microsoft - All opinions my own