Understanding Deep Learning

Published in

LecleVietnam

13 min readJan 26, 2024

Hello everyone,

Recent breakthroughs in Artificial Intelligence (AI) have come from Deep Learning, a subset of AI that utilizes artificial neural networks to process data and perform tasks such as object detection and speech recognition.

Deep Learning (DL) has created a revolution in supporting autonomous vehicles, providing machines with the ability to interpret image content and beyond.

In this article, we will delve deep into Deep Learning (DL), the architecture of deep neural networks, and the latest modern use cases.

I’m Neo — Admin — Community Manager of Optimus Finance and Growth Marketing of LECLE Vietnam.

1. What is Deep Learning (DL)?

Deep Learning (DL) is a subfield of Machine Learning (ML) and represents a collection of neural network architectures designed to tackle complex and advanced problems. These architectures (or models) are called Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, among many others.

The Deep Learning architectures are termed ‘deep’ because they consist of multiple layers. The depth of these layers is crucial, as you will see in this section. Deep Learning refers to neural network architectures with many layers that can learn (through the training process) to map an input, such as an image, to one or more outputs. When companies grapple with problems requiring substantial data, such as speech-to-text conversion and computer vision, data scientists turn to Deep Learning to address business challenges that cannot be tackled with linear algorithms.

1.1. Origins

The research on Deep Learning (DL) began in the 1960s, but its benefits became more evident in the recent decade. The first deep neural network with significant performance was developed in 1967, consisting of 8 layers. It wasn’t until Yann LeCun developed a deep neural network for handwritten zip code recognition in 1989 that the power of this new model became clear. His model required 3 days of training with test images using the standard backpropagation algorithm (a popular and commonly used supervised learning algorithm for multilayer neural networks).

Research on Deep Learning (DL) has continued to yield groundbreaking results in various domains, ranging from speech recognition to object detection in images and even natural language processing (NLP). New architectures are constructed to address novel and diverse challenges. By the year 2000, the impact and benefits of Deep Learning (DL) became evident.

1.2. Features and abstract layers

The depth of a deep neural network is crucial because, in the case of Convolutional Neural Networks (CNNs), it provides the foundation for abstraction. Figure 1 provides an overview of a CNN trained to determine whether an image contains a cat. The network is divided into two distinct parts identified assets of convolutional layers and classification layers. We will explore these in more detail later, but the convolutional layers will filter and detect features.

Figure 1: Deep neural network encoding features

In the initial stages of the network, these features include edges. Later in the network, features are combined into higher-level features (such as ears or eyes). The abstraction of these features creates a high-level feature detector that helps classify whether essential features are present to determine whether a cat is present in the image or not.

1.3. Utilizing GPUs for support

The training of Deep Learning (DL) networks is a computationally intensive task that has driven advancements in hardware. The availability of computational power is a key motivator for the application of deep learning models. This is where Nvidia’s GPU, known as the ‘workhorse’ of deep learning, comes into prominence. Initially designed for high-quality graphics rendering, GPUs quickly supplanted CPUs in usage, thanks to their integrated parallel processing capabilities that facilitate the execution of tasks on a large scale.

When GPUs were first employed in Deep Learning (DL) architectures, they reduced the training time for complex networks from several weeks to just a few days (performing a billion vector operations per pass). Today, GPUs have evolved into the processing powerhouse for Deep Learning (DL) architectures, and innovations continue to optimize Deep Learning (DL) further.

Here is an overview of the key differences between GPU and CPU — GPUs exhibit high parallelism and consist of thousands of individual processing units called cores. Unlike CPUs, which typically have no more than 4 or 8 cores, GPUs can concurrently perform operations on multiple vectors (fundamental for neural network processing), significantly improving the training and deployment speed of neural networks. When GPUs were first employed in Deep Learning (DL) architectures, they reduced the training time for complex networks from several weeks to just a few days (performing a billion vector operations per pass).

2. Deep Learning techniques

Deep Learning (DL) represents an architecture built upon the foundations of neural network concepts. Neural networks are computational structures — networks of computational elements that can be adjusted through the training process and then applied to various problems. Let’s go back to the origins of Deep Learning and explore the fundamental principles of neural networks.

2.1. From simple nerve cells to multilayer neural networks

A neural network is a network of neurons that perform a mathematical function. An input is fed into the network (typically in the form of a vector), and individual neurons compute their output at each layer until reaching the final output.

This process is called feedforward and represents the network’s computation. Each neuron can receive input from the previous layer through a specific input-adjusting weight. The input is multiplied by the weight, then aggregated and passed through an activation function to determine the output (Figure 2).

Figure 2: A Single neuron and its mathematical equivalence

The activation function can be any type among many different types (step, sigmoid, etc.) and is often chosen based on the type of network and the specific problem. For multilayer networks, the activation function is chosen to introduce non-linearity. Introducing non-linearity helps multilayer networks tackle relatively complex problems with a small number of neurons.

Figure 3: Multilayer Neural Network (3 Layers)

The network operates straightforwardly: inputs are applied, and then each layer of the network is computed, starting from the input layer. As the input layer provides input to the hidden layer, the hidden layer is computed next. This behavior is called feedforward because the computation process moves forward and does not create cycles or loops within the network.

2.2. Training

The weights of the network are explicitly defined for each connection, providing the basis for mapping input to output. The definition of these weights is part of the training process, and it occurs over many iterations to adjust the weights so that the mapping from input to output is done with an acceptable level of error.

Multilayer networks often use a type of algorithm called backpropagation to adjust the network’s weights. They do this by taking a sample from the training data, applying the input, computing the output, and then comparing it with the expected output (the forward pass process). The difference between the output and the expected output is the error. This error is used to adjust the weights of the network, starting from the output layer and moving backward to the input layer in a process called backpropagation. Through many training samples, this backpropagation process adjusts the weights in the network to minimize errors across the entire training dataset.

The process of applying a training sample to the network, evaluating the response, and then adjusting the network accordingly forms the basis of supervised learning. It is called supervised because the training dataset includes the desired behavior of the network.

3. Applications of Deep Learning

In this section, we will explore some advanced applications currently being applied in Deep Learning today.

3.1. Image and object recognition

One of the early successful applications of Deep Learning was in handwritten digit recognition for postal codes, applied since the 1990s. The variability in handwriting made this a challenging problem, but not one that Convolutional Neural Networks (CNNs) couldn’t solve. Building on this success, deep neural networks have been applied to object detection and object recognition (identifying a specific person from their face). In 2012, Google trained a deep neural network to recognize cats in YouTube videos with an accuracy of 70%, surpassing the accuracy of previous methods significantly.

One of the neurons in the artificial neural network, trained on unlabeled still frames from YouTube videos, learned how to recognize cats

3.2. Image and video description

When objects can be identified in images and videos, the natural next step is to provide a brief summary (or caption) for an image or video. This solution requires a set of methods to address, including a CNN to detect and recognize objects in an image or video, and then an LSTM (recurrent) network to generate a sequence of natural language words representing the specific input.

To tackle this specific use case, a large amount of data with many annotated images (each image with multiple captions) is needed for the training process. This particular challenge has been addressed through the use of crowd-sourcing, allowing the community to provide annotations for training purposes.

3.3. Speech recognition

Speech recognition has been a long-standing aspiration for Machine Learning from its early days. Unsurprisingly, deep neural networks have propelled this field to new heights. Deep neural networks have enhanced both the accuracy of recognition and direct processing of speech rather than intermediate representations.

Two main approaches applied in automatic speech recognition involve Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), given the sequential nature of speech. Connectionist Temporal Classification (CTC), a time-based classification method, has also been utilized to train both RNNs and LSTM networks, taking into account temporal variations in speech.

3.4. Language translation

In 2014, the first scientific paper on language translation with neural networks appeared. Since then, competitions have emerged, pitting researchers and their algorithms against each other in tackling language translation problems, predominantly relying on Deep Learning solutions.

Currently, a bidirectional RNN model operating on complete sentences serves as the foundation for Google Translate. In this approach, two independent RNNs are stacked on top of each other. They can understand the context from the past and future of the input sequence. Solutions of this kind are now referred to as neural machine translation.

Google Translate is a free, multilingual machine translation service developed by Google for translating text.

3.5. Automatic text summarization

Text summarization refers to the ability to generate a short, automatic summary of a longer piece of text. This functionality has been implemented in two ways: extractive, where important sentences are extracted from the original text to create a summary, and abstractive, where a new summary is generated from the original text data.

LSTM networks have been successfully used in this domain within an encoder/decoder model. In this model, the encoder represents an independent LSTM network that takes the input text data, and the decoder LSTM network constructs the summary as an independent sequence. This encoder/decoder architecture proves to be an ideal choice, handling varying lengths of sequences, such as the original text and the output summary.

3.6. Question and Answer systems

Question and Answer (Q&A) systems represent a classic and thoroughly researched problem, but Deep Learning has elevated this field to new heights in terms of utility. Q&A systems simulate human conversation while providing meaningful answers to questions posed by humans.

Deep neural networks of the recurrent type have been applied in this context. While they require a large amount of training data, these architectures have been successfully applied to the sequence-to-sequence problem of mapping questions to answers.

4. Deep Learning vs Neural Networks

An important distinguishing feature of deep neural networks is their depth, the number of layers beyond the width or the number of processing units in each layer. However, deep neural networks have evolved from conventional multilayer networks to the realm of Deep Learning. CNNs, effective with image data, sample and pool pixels from images for processing. RNNs, ideal for sequential data like text, consider not only one input but also the inputs that precede and follow it.

The structure of neurons has also changed as Deep Learning has progressed. Instead of merely summing weighted products passed through an activation function, neurons in modern DL networks, such as LSTM networks, include gates to control information flow and even forget information.

5. Types of deep neural networks

We have discussed some types of existing deep neural networks as well as the problems they have been applied to. Now, let’s delve deeper into these architectures to see how they are decomposed and the training methods used for each type of network.

5.1. Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) come in various architectural types, but all of them involve the behavior of internal states, meaning they can be applied to problems in the time domain. As illustrated in Figure 4, a simple feedforward network is augmented with two neurons fed from the hidden layer (referred to as the Elman network), then provides feedback to the hidden layer. This cycle introduces a simple form of memory in the neural network, making them ideal for predicting time sequences.

Figure 4: Expanded recurrent neural network

To train a Recurrent Neural Network (RNN), a variant of the backpropagation algorithm called backpropagation through time is needed. This is a generalization of the standard backpropagation algorithm commonly used in multi-layer neural networks.

5.2. Convolutional Neural Network (CNN)

CNN is an outstanding deep neural network architecture for image classification. It operates through a series of convolutional and max-pooling operations on the input (see Figure 5). The convolutional process (or filters) takes a small part of the input image and applies a filter to reduce the dimension of the input into a smaller matrix.

This step is performed across various regions of the input image. The next step is max-pooling, reducing the dimension of the image similarly by returning the maximum value for the convolved matrix (the input from the convolutional layer). This process repeats for several layers until reaching the classification layer. The final max-pooling layer provides data for a fully connected neural network (each output class comes from the output of the final max-pooling layer). The classification layer uses high-level features to determine the class for the specific input image.

The backpropagation algorithm is commonly used to train a convolutional neural network through supervised learning (adjusting the network’s weights based on classification errors).

5.3. Long Short-Term Memory Network (LSTM)

The LSTM networks are RNNs with internal memory. An LSTM block is used to construct a unidirectional (as shown in Figure 6) or bidirectional network, where each block provides information to the right and above blocks. The LSTM block consists of three gates controlling the flow of information within the block. These gates include the input gate, controlling how new information is introduced into the block; the forget gate, controlling when the internal memory is discarded, and the output gate, used to compute the output of the block. The connections between blocks and gates are weighted, allowing them to be adjusted during training.

Figure 6: Long Short-Term Memory Network and Block

LSTM networks, similar to RNNs, are well-suited for sequence problems. One of the most interesting applications of LSTM networks is in building human-like language descriptions for an image (provided by CNN). LSTM networks can be trained using backpropagation through time with supervised learning.

6. Deep Learning frameworks

Deep Learning frameworks typically provide libraries and tools that can be used throughout the development cycle. For instance, you will find tools for cleaning and preparing your data for training and validation, as well as tools for testing your model in a production environment. These frameworks also offer detailed documentation to help you quickly familiarize yourself with them, enabling you to build and deploy your own Deep Learning solutions.

Here is a table listing some of the key frameworks and the deep learning architectures that can be built using them.

7. Advancements in Deep Learning

Deep Learning has made advancements in various business domains. In a critical area, CNNs have achieved accuracy comparable to board-certified dermatologists in classifying images of benign or malignant skin lesions. However, as DL architectures evolve along with their methodologies, they will be applied to new challenges and pushed further.

A significant challenge in DL is the training process. GPUs have reduced the time required to train deep neural networks, but as new problem domains are explored, complex architectures requiring larger datasets also emerge. Hardware continues to be applied to these areas, with innovations in GPUs focusing on deep learning tasks. Increasing the number of GPU cores and memory bandwidth at a superficial level provides positive benefits for deep neural network training. New technologies enabling GPUs to communicate directly with each other rather than through their server system have also shown advantages for specific training tasks.

If Deep Learning was previously confined to servers with multi-core processors and high-performance GPUs, it is now expanding its reach to small embedded devices like smartphones. This development is progressing on two fronts: adapting deep neural network algorithms to make them more resource-friendly for embedded environments with limited resources and optimizing custom processors for these technologies more energy-efficiently.

A recent study indicated that training a medium-sized deep neural network for a natural language processing application leads to CO2 emissions equivalent to the annual average of several cars in the United States during their usage period. An important research topic to address this issue is transfer learning. Transfer learning involves using a previously trained deep neural network for a similar problem domain. This way, the neural network doesn’t need to be trained from scratch; instead, it has a starting point and is then fine-tuned for the specific problem domain. This method has shown great promise in reducing training time as well as minimizing the amount of new data required for the training process.

8. Closing thoughts

Deep Learning has surpassed the concept stage and is being deployed by businesses of all sizes. Deep Learning frameworks are also making these architectures more accessible and popular. The next wave of innovation in Deep Learning will come from next-generation processors providing the necessary frameworks and potentially analyzing ‘high-level’ machine learning code.

What about your thoughts? If you want to know further about it, don’t hesitate to share it with us! 😀

This post is for educational purposes only. All materials I used were the different reference sources. Hope you like and follow us and feel free to reach out to us if there is an exchange of information. Cheers! 🍻

#leclevn #leclevietnam #Deeplearning #DL #AI