Introduction to Deep Learning

11 min readMar 20, 2018

“Machine intelligence” has started to show its predominance in various devices used today. From “Speech to text” translation to Self-driving cars, language translation and even gaming computers which can beat human champion players, Deep Learning and AI is present everywhere. With the rapid growth in research and technology it is easy to imagine that soon the magical world of Harry Potter or the pages from a scientific fiction could come alive. So how do these intelligent machines come to exist? Who creates them or rather how is it created? The answer to these questions would point to many areas of technology namely Artificial Intelligence, Machine Learning, Computer Vision, Deep Learning, Robotics, Information Theory, Data Science. However, today we are going to discuss about the fundamentals of Deep Learning.

Applications of Deep Learning:

Here are just a few more examples of deep learning at work:

A self-driving vehicle slows down as it approaches a pedestrian crosswalk or when the traffic signal is red by processing it’s surrounding.

2. An ATM rejects a counterfeit bank note and maybe alarms a bank nearby.

3. A smartphone app gives an instant translation of a foreign street sign or translates images with text to actual text.

Deep learning is especially well-suited for many identification applications such as face recognition, text translation, voice recognition, and advanced driver assistance systems, including, lane classification and traffic sign recognition.

WHAT IS DEEP LEARNING?

Loosely inspired by a model of the human brain, deep learning is training a neural network to learn and identify features in the training data so that it can identify similar features and give an appropriate output when a new test data has been fed into the same network. Let’s look at an example. Say we have images of dogs and cats and we want to apply deep learning to teach a machine to identify images of cats and dogs. This is an example of a binary classification problem where the output can be 1 ( meaning it is a cat picture) or 0( meaning it is a dog picture). We first label the images in order to have a training data for the network. Using this training data, the network can then start to understand the object’s specific features and associate them with the corresponding category.

Each hidden layer in the network takes in data from the previous hidden layer, transforms it, and passes it on to the next layer. As we go deeper into the network, it starts learning more complex features and details from the training dataset. The number of layers in a neural network depends on the complexity of the problem the network is trying to solve/learn. It can always be modified to add or remove layers to help train the network better.

**Fig — 1.1:** A Neural Network for binary classification of images

In reality, a Deep Learning Neural Network looks like this.

**Fig — 1.2:** This is a fully connected neural network.

Components of a Neural Network:

1. Input [ X -> {x1, x2, x3}] and output [ Y -> {y1, y2, y3}] layers.

2. One or more hidden layers (Layers colored yellow). Layer l= 0 is the input layer and layer l = L is the output layer as shown in Figure 1.2. The layers form a Markov chain.

3. Each layer contains several nodes also called as neurons/ activation units. The number of activation units in layer l for l = 0, . . . , L is denoted by n_l. Here the number of neurons in layer 0: n_0 =3, layer 1: n_1=5 , n_2= 4, n_3=3. Here the last layer , L = 3. Thus it is a three layered NEURAL NETWORK.

4. Every node in a particular hidden layer is connected to every node in the next hidden layer which gives it the fully connected nature.

5. The number of nodes in the output layer depends on what problem the Neural Network is working on. Eg- For a binary classification problem, the output layer can contain only one node with a value of 1 or 0 indicating a cat or a dog respectively. For classifying images of many objects called as a multi-class classification, the output layer will contain more than one output nodes.

WHAT HAPPENS INSIDE THE NETWORK?

Individual hidden layers inside a Neural Network implement a mathematical function which enables the network to learn certain features from the input coming from the previous layer. Logistic Regression is an example of a small Neural Network. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits. The equation for Logistic Regression is shown below:

**Fig — 1.3:** Graph showing the equation representing Logistic Regression

**Fig — 1.4:** The logistic function equation is shown above. Source- https://en.wikipedia.org/wiki/Logistic_regression

Computations of a Neural Network:

Neural Networks are organized in terms of a forward pass called as forward propagation and a backward pass called as backward propagation. Forward propagation helps us calculate the loss while backward propagation helps us update the weight w and bias b by calculating derivates.

Forward propagation:

Each layer of a Neural Network has parameters weights-w and biases b. For a Neural Network these parameters are initialized randomly. These two parameters define the level of connection between the neurons of different layers. Forward propagation of activations from layer l − 1 to layer l is a mapping of the activations from A^(l−1 )→ A^l , defined by a matrix multiplication and a summation as follows:

**Fig — 1.6** : Z^l of layer l is the output generated by taking the activation from the previous layer, A^(l-1) , doing a matrix multiplication with weight parameter W^l of layer l and adding the bias b^l to that layer.

The equation used above is derived from linear regression. We will understand this better with an example discussed below.

Now that we have calculated the output of the forward propagation step: Z^l of layer l, A non-linear activation function like ReLU, sigmoid, tanh, Leaky ReLU etc is applied on it which gives us the final activation output of layer l.

Fig — 1.7: Activation output of layer l is A^l. The non linear function applied to output Z of layer l is denoted by g^l. g^l can be ReLU, Sigmoid, LeakyRelU etc.

In the example discussed below for identifying between cat and dog images, we have used the sigmoid function( logistic regression) as the non-linear function applied to the output Z. We shall discuss the reasons of choosing it while explaining an example below. The image below shows a Neural Network with two hidden layers and equations used to represent the forward propagation step.

**Fig — 1.8**: Figure showing forward propagation in one layer NN.

In Fig-1.8, there are three inputs x1,x2,x3 which can be described as an input vector X. The first hidden layer contains 3 neurons and the output contains only one neuron. Each neuron of each hidden layer has it’s own weight {w1,w2,w3} described by matrix W ( 3*3 ) and bias {b1,b2,b3} described by vector b ( 3* 1 ). The outputs Z and A are also vectors containing the outputs of corresponding to the input vector X.

Backward Propagation:

This method calculates the gradient of the loss function with respect to the neural network’s weights. This maps the derivatives from layer l back to layer l− 1 with respect to both activations and weights. A practical method for minimizing the loss function is called as gradient descent.

Fig — 1.9: Figure on the left shows a forward propagation step. Figure on the right shows a backward propagation from last layer to one layer ahead.

Example:

Let us understand the concept of forward and backward propagation with a binary classification problem. Our training example X is a set of images of dogs and cats with their corresponding labels Y( 0/1) respectively. That is, our input feature vector X is a combination of images {x_1, x_2, x_3,.. x_n} and it’s corresponding labels {y_1,y_2,y_3….y_n}.

We want to predict the conditional probability ie P(Y=1|X). Given a new image x_n+1 we want to predict if Y is 0 or 1.

Data Preparation:

Each colored image of a cat/dog has three channels: Red, Green and Blue. Let’s say we have “m” training examples each of dimension 64*64. Then, each image can be organized in the form of a column vector of the dimension 64*64*3. Thus all the “m” training examples can be organized as a vector X whose dimension is 12288*m. The pictorial representation of vectorization has been shown in the image below.

**Fig — 2.0:** The top boxes show a 64*64 image having red , blue and green channels. This image has been converted to a column vector X. If we have m images the matrix X will have a dimensiom nx*m.

Neural Network Synthesis , Forward Propagation:

Our neural network will be using Logistic regression equation for the forward propagation step. The logistic regression equation uses the sigmoid function denoted by a symbol : sigma.

**Fig — 2.1:** Diagram showing one layer Neural network with forward propagation and backward propagation.

**Fig — 2.1.1:** Equation for a logistic regression problem.

Why do we use a sigmoid non-linearity?

We don’t want our model to predict the probability value to be below 0 or above 1 because the model is trained with input labels having values 0/1. A linear regression can generate very high values and even negative values which does not make sense. Thus, the sigmoid function helps to achieve this goal. In a simpler language, to normalize the output between 0/1 we use the sigmoid function for training our network.

Last step of forward propagation : Cost Function:

Ideally we would want our Neural Network to perform with 100% accuracy which means for every new image fed to the network, it would generate the correct result. However, this does not happen in real life and we need to train our parameters w, b to improve the performance of our network. Here comes the need of knowing two metrics : cost function and loss function. The loss function tells us how good the predicted output y^ is when the true output is y. We always want our loss to be as small as possible. The formula for calculating the loss function is as follows:

**Fig — 2.2**: Loss function for Neural Network

Cost function is loss measured on the entire training set. The formula for calculating the cost function is as follows:

**Fig — 2.3**: Cost function for Neural Network

The ultimate goal is to find parameters w , b which helps in minimizing the cost function J(w,b). Thus, during the training stage after calculating the output from the last layer of the neural network, we calculate the loss J(w,b) to check how well our network has performed.

Backward Propagation : Updating parameters w , b:

Steps showing back-propagation in a Neural Network for a single training example.

Deriving the formula’s for the derivates in back-propagation in a Neural Network for a single training example.

In the images attached above, we can see how we can move from the last layer L to the first layer and keep updating the weights w and the bias b. For other non linear activation functions like ReLU, softmax etc the derivative equations will be different.

What is this alpha parameter / the learning rate?

The parameter alpha used here for updating the weights and biases is called as the learning rate. Learning rate is a decreasing function of time. Intuitively speaking , it is how quickly a network changes old beliefs for new ones. When we train our network with a gradient descent algorithm , at each iteration we use back-propagation to calculate the derivative of the loss function with respect to each weight and bias and subtract it from that weight and bias. However, if this is process is repeated several times the weights will change far too much after each iteration, which will make them “overcorrect” and the loss will actually increase/diverge. So in practice, we usually multiply each derivative by a small value called the “learning rate” before they subtract it from its corresponding weight.

Summary Algorithm:

Given an input example and ground truth labels we perform the following three steps:

Forward propagation: forward propagate the activations through all layers from input to output, reaching a prediction.
Compute loss function: We compute the error between the prediction and ground truth.
Back-propagation: Finally, we use the chain rule for differentiating and calculating the gradients through the layers in the opposite direction from the output to the input.
Update the weights and biases of each layer by the formula:

**Fig — 2.4**: Updating the weights and biases of layer 1. The value alpha is also called as the learning rate.

Commonly asked questions about Neural Networks:

Thanks to Albert Zenon Fernandez

Whats the difference between hidden layer and input/output layer?

According to Deep Learning Terminology, the first layer is called the input layer because it contains the input data for training the Neural Network, the other layers are called the hidden layers. The term hidden is given mainly because we don’t see what the values of the hidden layer should be in the training set. We can see what the inputs are, we can also expect what the output should be, but the things in the hidden layer are not seen in the training set. So that kind of explains the name hidden. In our example the single-node layer is called the output layer, and is responsible for generating the predicted value y hat.

Does every node perform it’s own calculations?

Yes, every node performs it’s own calculation of finding the output matrix Z and then applying a Non-Linear activation function to it. In our example we have used sigmoid function. This can be understood with the image shown below.

**Fig — 2.5:** Image showing the activation values calculated for each and every neuron.

Why are layers necessary?

That’s a good question actually. The figure below shows a timeline of machine learning from Least Squares to Alpha Zero. The timeline is composed of three parts in which (i) neural network components where invented in the 1960’s, (ii) combined in the 1980’s, and (iii) applied at scale in the 2010’s.

**Fig — 2.6:** Timeline for evolution of perceptron.

The Perceptron was described in 1957 by Rosenblatt. That is how it all started.

Why is it important for the nodes to be fully connected?

I guess it is important for every node to learn the features of every input variable coming from the previous layer. This helps the network to learn more complex features to be able to perform a better task at prediction.

Why is it important to know both the loss and the cost?

Loss is just calculated over one input feature. For example if we train our Neural Network with only one image, then loss and cost will be the same. In real life we train our network with many images, that’s when cost comes into picture because it helps us understand how the network is performing over the entire training set. The cost gives us one single value/score which is easier to understand than a matrix of losses for each individual example.

How many hidden layers should we include in our NN?

Honestly as this is a hyper-parameter , there is no fixed particular way to determine the number of hidden layers you should choose. The most common rule of thumb is to choose a number of hidden neurons between 1 and the number of input variables. You can always use cross validation to test your architecture. If the model has a High bias you know it’s overfitting and you need to reduce the number of layers and neurons. The basic idea to get the number of neurons right is to cross validate the model with different configurations.

References:

Deep Learning Specialization from Andrew Ng Coursera.
Professor Iddo Drori’s lecture notes from NYU.
Wikipedia and google images for image resources.
Deep Learning blogs.
Book by Ian Goodfellow and Yoshua Bengio.
Wikipedia for definitions and references.

Hope you guys have now understood what deep learning really talks about. If you enjoyed reading this article, please do clap :)