What we talk about when we talk about Deep Learning?

Hamid Alipour
1st principles
Published in
6 min readAug 7, 2020

This article is part of a series of articles that are going to describe the simple principles of science and tech.

Deep learning is a branch of machine learning that has contributed to many of the recent machine learning breakthroughs that we have observed in the past couple of years, from speech and image recognition to natural language understanding. When we talk about deep learning, we refer to the learning algorithms that are being used to learn deep artificial neural networks. While artificial neural networks can model many complex nonlinear systems, in their core structure, they use a simple linear learning algorithm known as Perceptron. Perceptrons can be considered as the building blocks of artificial neural networks. To draw the analogy with the biological neural networks, a perceptron in an artificial neural network play the same role as a neuron plays in a biological neural network.

Perceptron architecture is inspired by neurons.

A regular artificial neural network is usually composed of a few perceptrons connected in one or two hidden layers. These types of networks are usually called shallow or multi-layer networks. On the other hand, deep neural networks are composed of many perceptrons structured in multiple dense layers.

Regular vs Deep neural network

An advantage of having multiple dense layers is that it enables the models to learn more complex functionalities if they are provided with enough data. Also, having many perceptrons has enabled the researchers to structure these nodes in many creative ways to develop different innovative neural network structures suitable for a variant of application domains.

But why now! Neural networks, as we know them today, have been around for almost half a century, but why it took researchers so long to start applying deeper neural networks in real-world applications. The main reason is that training a deep neural network is computationally expensive. In addition, it needs a huge amount of data to train a deep model properly. In fact, if a deep neural network is not provided with enough training data, it underperforms the small neural networks and other traditional machine learning methods. It is only when deep networks are properly trained with enough data that they show their magical results.

The recent advancements in computing technologies, especially the advent of powerful GPUs along with the abundance of data, have made deep learning solutions an excellent candidate for solving some of the challenging real-world problems such as speech recognition, image recognition, and natural language understanding. An analysis performed by ARK-investment shows that the Cost of AI Training is improving at 50 times the speed of Moore’s law. This exponential improvement in the cost is the main factor that has made the training of larger and more powerful deep learning models possible.

The cost of training a ResNet-50 a deep image classifier has dropped exponentially by 10x during the last two years. (https://ark-invest.com/analyst-research/ai-training/)

But what to consider when designing a deep learning algorithm:

Like any other machine learning algorithm, the goal of deep learning algorithms is to learn the ideal behavior of a complex system. This ideal behavior is unknown, and the algorithm is only provided with some examples from the system's past behavior known as training data. A machine learning algorithm defines a set of possible candidate functions (a.k.a Hypothesis Set) to represent the ideal behavior of the system (a.k.a Target Function). It then applies a set of learning techniques to find the best candidate among those candidates in the hypothesis set based on their performance on the provided training data. This selected candidate is used as the learned model to model the behavior of the system.

So, to design a deep learning model, you need first to design the architecture of the neural network. The neural network’s architecture specifies how many layers are in the network, how the network nodes are organized in each layer, and how the nodes are connected to each other. Artificial neural networks with different network architectures have been applied to a variety of problem domains. Applying a network with the right architecture to the right problem domain is important to achieve good results. In fact, the architecture of a neural network defines the type of candidate models that will be included in the hypothesis set. The learning algorithm will eventually select one of the candidate models (a.k.a final hypothesis) from the hypothesis set to represent the system, so no matter how well your learning algorithm, if the right candidate is not included in the hypothesis set, the final model will not perform well. Therefore, While one type of network architecture might show promising results in a specific problem domain, it might not perform well in other domains. A lot of research in deep learning has been dedicated to finding the best neural network architectures for different problem domains. For example, Recurrent Neural Network(RNN) and Long/Short Term Memory(LSTM) have shown impressive results in Natural Language Understanding problems, while Convolutional Neural Networks(CNN) have shown to be great candidates for image recognition.

Popular deep neural network architectures (https://www.asimovinstitute.org/neural-network-zoo/)

Besides the network architecture, you also need to define the learning algorithm used to train the network. The learning algorithm is the process and criteria that are used to evaluate the candidate models against the training data to find the best-performing candidate to represent the system. As we discussed, the nodes in a neural network are simple perceptrons with a nonlinear activation function applied to their output. The outputs of the perceptrons from each hidden layer are connected to the inputs of the perceptrons in the next layer. In this network structure, each perceptron has a set of weights W=(w0,…,wn) that are applied to their inputs. You can consider these values as the parameters that are assigned to the edges in the neural network. The learning algorithm in deep learning is basically an optimization algorithm. It is applied to find the best values for the network parameters to minimize the model's error on the training data.

Defining the learning algorithm basically means defining the loss function that measures the performance error of the model as well as the optimization algorithm that is used to minimize that lost function over training examples. Cross-entropy and mean-square-error are some common loss functions used in deep learning algorithms. And Stochastic Gradient Descent and Adam optimizer are two common optimization algorithms applied in many popular deep learning models.

During the learning process, the optimization algorithm starts with some random or heuristic initial values for the weights. The optimization happens as a two-step process. In the first step, known as forward propagation, the algorithm passes the training data through the network and calculates the loss function for the training data. It also caches some temporary data for each layer during this step which are needed during the second step, known as backward propagation. In the backward propagation step, the optimization algorithm updates the model's weights to minimize the loss function. The learning algorithm iterates through this two-step process to eventually converge to an optimized value for the model parameters. The optimized parameter values are used as the selected weights for the final model.

https://youtu.be/aircAruvnKk?t=975

To summarize, Deep learning is a machine learning algorithm that utilizes a large amount of data to learn a deep neural network to model a complex system. The network uses a large number of nodes in multiple hidden layers with many parameters. The algorithm defines a loss function and applies an optimization algorithm to minimize the loss function over training data. The final optimized values are used as the weights in the final model.

--

--