Demystifying Deep Learning — How Neural Networks Learn?

Ambuj Agrawal
DataSeries
Published in
5 min readJul 7, 2020

Deep Learning has contributed massively to the tremendous progress and boom in Artificial Intelligence we see all around the world today. Tasks performed by modern AI systems like text and image classification, instance segmentation in images, question-answering based on textual data, reading comprehension, and lots more — the science-fiction of the past — have become increasingly useful and human-like with the use of deep neural networks.

Photo by Nina Ž. on Unsplash

How exactly do the successful neural networks learn to perform these significantly complex tasks? What goes on underneath layers of seemingly infinite little bits of math operations that these networks perform?

A simple neural network

Let’s dig deeper and understand the theory and fundamentals behind deep neural networks conceptually.

First, let’s talk about the algorithm that’s used by most (if not all) neural networks to “learn” from training data. Training data is nothing but human-annotated data, such as labelled images for an image classification task, or labelled true sentiments for sentiment analysis on textual data.

The name of the algorithm is Backpropagation. We’ll see why it is called this in a short while.

Here’s a brief overview of the structure of neural networks:

Neural networks map from an input to an output. The input can be an image, a piece of text, etc. The input is converted to its numerical representation: for e.g., for an image, the numerical pixel values at each pixel position are considered — for text, each word is represented as a vector of numbers, which can be a word embedding (vector of numbers, where each number represents a score for a particular characteristic of the word) or a one-hot vector (a vector with size ’n’, composed of n-1 zeros and 1 one, where the position of the one indicates the selected word.

These numerical inputs are then passed through a neural network (known as forward-propagation) which involves several steps of multiplication by weights in the network, addition of bias terms, and then passage through a non-linear activation function. This forward-propagation step is performed for each input in the labelled training data, and the accuracy of the network is calculated with the help of a function known as the “loss” or “cost” function. The objective of the network is to minimize the loss function, i.e. maximise its accuracy. Initially, the network starts out with random values for its parameters (weights and bias terms) and then gradually improves its accuracy and minimizes its loss by continues improvements to these parameters, on each iteration through forward-propagation on training data. This update of weights and bias terms (the magnitude, as well as direction: positive or negative) is determined by the backpropagation algorithm. Let’s see what the backpropagation algorithm is all about, and how it can effectively help neural networks “learn” and minimize loss on training data.

Forward propagation in a deep neural network

The core of backpropagation lies in figuring out by what value each parameter should be changed, or updated, in order to fit the training data better (i.e. minimize loss/maximise prediction accuracy). The method by which these values are determined is something quite simple, in essence:

In the diagram above, the y-axis represents the cost function and the x-axis represents some parameter (weight) in a network. The initial weight must be decreased to get to the point of local minima. Well, how does the network figure out, with the weight at its initial position, that it should decrease it to get to the minima? It looks at the slope of the function at the initial point.

How is the slope obtained? If you’ve taken a calculus course, you’ll know that the slope of a function at a point is given by its derivative. Viola! We can calculate the slope, and hence the direction of change (positive or negative) that the weight should undergo to get to the minima. Also, the magnitude of slope gives an indication of the amount by which the initial weight needs to be updated. We update the value of the weight iteratively, and eventually get to the minima, achieving the lowest loss and greatest accuracy!

The complication arises when the weights are not directly related to the loss function, but indirectly — as in the case of deep neural networks. The familiar chain-rule from calculus comes into play here.

For example, in the image shown, the output ‘y’ is not directly influenced by the input ‘x’, but indirectly — ‘x’ gets passed through ‘f’, and ‘g’ functions before giving the output ‘y’. Using chain rule, one can write the derivative of ‘g’ with respect to ‘x’ as shown in the figure — showing the dependence of ‘g’ on ‘f’, which in turn depends on ‘x’. This can be done for networks of any length, with the resulting derivative — and hence the slope — for any output, with respect to an input, obtained as a multiplication of derivatives of all the steps the input passes through. This is the essence of backpropagation, where the derivative/slope of the output variable with respect to each parameter is got by multiplying derivatives traversing backwards through a network till the parameter’s direct derivative is found, and hence the name backpropagation.

--

--

Ambuj Agrawal
DataSeries

Ambuj is a published author and industry expert in Artificial Intelligence and Enterprise Automation (https://www.linkedin.com/in/ambujagrawal/)