Deep Neural Networks

Rochak Agrawal
May 26 · 6 min read

Deep Neural Networks are neural networks with many hidden layers. The number of hidden layers in such a network can range from 3 to a few hundred. The first question that arises in our mind is, Why do we need so many hidden layers? The answer to this question is that we want the neural network to learn complex functions. The first few layers of a deep network learn simple features, and as we go deeper, the network learns various sophisticated features that are usually incomprehensible by humans. In this post, let us understand the working of deep neural networks in a mathematical context. Given below, is a four-layer deep neural network which I would be considering for further explanations.


The number of layers L = 4. It is the sum of the hidden layers and the output layer. Here we have three hidden layers and one output layer.

The number of neurons in a layer l is represented by n[l]. Here we have,

  • n[0] = 3 i.e. number of features in the training dataset
  • n[1] = 4
  • n[2] = 4
  • n[3] = 3
  • n[4] = 1 i.e. number of output classes.

There is no general rule for selecting the number of layers and the number of neurons in a particular network. It is empirical. If you are building a network by yourself, then you should begin by selecting only a single layer and increasing them as you go. You can evaluate the results on the test dataset and choose the best configuration depending on your use case.

The weights and bias associated with a layer l can be represented by matrices W[l] and b[l] respectively. The dimension of matrix W[l] = (n[l],n[l-1]). It is because all the incoming weights to a particular neuron are arranged in a single row and there is a row for each neuron present in that layer. As each neuron has a bias associated with it, the dimension of matrix b[l] = (n[l],1). Therefore we have,

  • W[1] = (4,3) matrix and b[1] = (4,1) matrix
  • W[2] = (4,4) matrix and b[2] = (4,1) matrix
  • W[3] = (3,4) matrix and b[3] = (3,1) matrix
  • W[4] = (1,3) matrix and b[4] = (1,1) matrix

The activations of a layer l are represented by matrix A[l]. The activation of a neuron can be thought of as the output of that neuron. Therefore, the shape is dependent on the data that we provide to the neuron. In general, the dimension of matrix A[l] = (n[l],m) where m is the number of training examples.

One may wonder, Why not use only a few layers and put many neurons in it? The answer to this lies in Circuit Theory. The circuit theory states that we require an exponential number of neurons in a shallow network which achieves similar accuracy as that of a deep network. Therefore to avoid that exponential factor and allow the network to learn complex functions, we prefer a deep network consisting of many hidden layers.


The Weights and Bias of the neural network are initialised randomly, and they output random noise. To enable the network to output the correct values, we train the network. Training the network is nothing but minimising the loss (the difference between the predicted values, i.e. the output of the network and the original output) so that the predicted value is similar to the original value. There are three steps in the whole training procedure, which I have discussed below.

Forward Propagation

The steps by which we calculate the output from the input is known as Forward propagation. It makes use of the input matrix X, Weight matrices W[1], W[2],, W[L] and bias matrices b[1], b[2],, b[L]. Mathematically, we compute the output using the following equations:

In the above equations, the function g(x) represents the activation function. Each layer can use a different activation function and hence, they are represented by g[i](x). If you observe, the above equations follow a trend and can be generalised using the following equations:

Backward Propagation

The way we update the weights and bias of the network is known as backward propagation. In this phase, the neural network “learns” with the help of gradient descent. Gradient descent makes use of derivatives of the calculation in forward propagation to minimise the loss and then, update the weights and biases of the network. Each step of backward propagation can be generalised using the following equations:

In the above equations, “*” represents element-wise multiplication whereas “.” represents matrix multiplication. The derivation of the above equations requires a detailed understanding of calculus and is beyond the scope of this post. You can refer to my previous posts for getting a clear understanding of how they are derived. Moreover, I would recommend the readers who know calculus to derive the equations themselves for a better understanding of the topic.

Updation of Weights and Biases

Now that we have the derivatives of weights and bias with us, we update them using the following equations:

In the above equations, alpha is known as the Learning Rate. It is the factor that determines how much the weights are updated. A high learning rate indicates too much change in the weight and bias values in a single training step and vice versa. It is crucial to find the perfect learning rate so that the network is trained efficiently.

I urge the readers to figure out the dimensions of the matrices once by themselves. It would develop a concrete understanding of how various matrices are represented mathematically and how the data flows from the input to the output.

Summing it up

You might have heard that training a neural network takes a considerable amount of time and resources. The parts which I have explained above are carried out in just a single step. Moreover, I have considered only three input features. In real life, there would be hundreds of features and many hidden layers, and it would take hundreds on thousands of steps to achieve excellent performance. All this increases the required computation resources and the time to a great extent. The whole procedure can be described using the following:

I want to thank the readers for reading the story. If you have any questions or doubts, feel free to ask them in the comments section below. I’ll be more than happy to answer them and help you out. If you like the story, please follow me to get regular updates when I publish a new story. I welcome any suggestions that will improve my stories.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Rochak Agrawal

Written by

Learning a new thing each day. Publishing them each week. LinkedIn:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade