From AlexNet to BERT: The simplest review of the most important ideas in deep learning


The author of this article, Denny Britz, summarized the most important ideas of deep learning according to time. I recommend newcomers to see. Almost all the most important ideas since 12 years are listed. These ideas can be said to support countless people. These have published countless papers, they are in order:

  • AlexNet and Dropout: AlexNet directly opened the era of deep learning, and laid the basic structure of the CNN model in CV afterwards. Needless to say, Dropout has become the basic configuration.
  • Atari of Deep Reinforcement Learning : The pioneering work of deep reinforcement learning. After DQN, it also opened a new way, and everyone began to try various games.
  • Seq2Seq+Atten: I don’t have to say about the impact of this in the NLP field. For a while, I even said that any NLP task can be solved by Seq2Seq+Atten, and this article actually laid the foundation for a pure Attention Transformer.
  • Adam Optimizer: Not much to say, the training model has a good heart.
  • Generative Adversarial Networks (GANs): This is also a mess in the past few years since 2014. Everyone is engaged in various GANs. It was not until the release of the integrated model of StyleGAN last year that it almost disappeared. Deepfake, which has caused various controversies, is one of the results. Recently, people have seen people use it to make fake materials.
  • Residual Networks: Like Dropout and Adam, it has become a basic configuration. The model depends on it.
  • Transformers: The pure Attention model is directly replaced by the LSTM in NLP, and it has gradually achieved good results in other fields, and it also lays the foundation for the BERT pre-training model in the future.
  • BERT and fine-tuned NLP model: Using a very scalable Transformer, plus a large amount of data, plus a simple self-supervised training target, you can obtain a very powerful pre-training model to sweep various tasks. The most recent one is GPT3. Since the API was given, various special fancy demos have been shown on the Internet. It is simply a variety of automatic completions.

The author will review here some ideas that have been widely used in the field of deep learning after the test of time. Of course, they cannot be fully covered. Even so, the deep learning techniques introduced below have already covered the basic knowledge needed to understand modern deep learning research. If you are a newcomer in the field, then it is good. This will be a very good starting point for you.

Deep learning is a rapidly changing field, and a large number of research papers and ideas may feel a little bit too late. Even experienced researchers are sometimes confused and it is difficult to tell the company what the real breakthroughs are in PR. According to “Time is the only criterion for testing truth”, the author reviews those studies that have withstood the test of time in this article. They or their improvements have been repeatedly used in various studies and applications, and the effects are obvious to all.

If you want to get started right after reading this article, then you think too much. The best way is to understand and reproduce the classic papers mentioned below, which can lay a very good foundation for you, and it will also be very helpful for you to understand the latest research and develop your own projects in the future. In addition, it is useful to browse the papers in the following chronological order to help you understand where current technologies come from and why they were invented in the first place. To put it simply, the author will summarize as little as possible in this article, but cover most of the basic knowledge needed to understand modern deep learning research.

Regarding deep learning, one characteristic is its application fields, including machine vision, natural language, speech, and reinforcement learning. These fields all use similar technologies. For example, a person who used deep learning to do computer vision can quickly achieve results in NLP research. Even though the specific network architecture is somewhat different, the concepts, methods, and codes are the same. This article will introduce some research from different fields, but before entering the topic, you need to make a statement:

This article is not intended to provide in-depth explanations or code examples for the research mentioned below, because long and complex papers are actually difficult to summarize into a short paragraph. Instead, the author will only briefly outline these technologies and related historical background, and provide links to their papers and implementations. If you want to learn something, it is best to use PyTorch to reproduce the experiment in the paper from scratch without using the existing code base or advanced library.

Affected by the author’s personal knowledge and familiar fields, this list may not be comprehensive enough, because many sub-fields worth mentioning are not mentioned. But mainstream fields recognized by most people, including machine vision, natural language, speech, and reinforcement learning, are included.

And the author only discusses research on official or semi-official open source implementations that can be run. Some researches that have a huge amount of engineering and are not easy to be reproduced, such as DeepMind’s AlphaGo or OpenAI’s Dota 2 AI, let alone.

Some research options may be somewhat arbitrary. Because there are always some similar technologies released in a similar time, and the purpose of this article is not to conduct a comprehensive review of them, but to introduce various researches in various fields to Mengxin. For example, GAN may have hundreds of variants, but no matter which one you want to study, the basic concepts of GAN must be known.

2012: Processing ImageNet dataset with AlexNet and Dropout

Related papers:

ImageNet Classification with Deep Convolutional Neural Networks [1]:

Improving neural networks by preventing co-adaptation of feature detectors[2]:

One weird trick for parallelizing convolutional neural networks [14]:

Implementation code:

Pytorch version:

TensorFlow version:

Illustration source: [1]

It is generally believed that AlexNet has started the big wave of deep learning and artificial intelligence research in recent years. And AlexNet is actually a deep convolutional network based on LeNet proposed by Yann LeCun in the early years. The unique thing is that AlexNet has obtained a very big improvement by combining the powerful performance of GPU and its superior algorithm, far surpassing other methods of classifying ImageNet data sets. It also proves that the neural network is indeed effective! AlexNet is also one of the earliest algorithms to use Dropout [2], and since then Dropout has become a key component to improve the generalization capabilities of various deep learning models.

The AlexNet architecture is a series of modules composed of convolutional layers, nonlinear ReLU and maximum pooling, and now these have been accepted by everyone and have become the network structure of standard machine vision. Nowadays, because libraries like PyTorch are already very powerful, compared with some of the latest architectures, the implementation of AlexNet is very simple, and can now be implemented with a few lines of code. It is worth noting that many implementations of AlexNet use a variant of it, adding a trick mentioned in this paper One weird trick for parallelizing convolutional neural networks .

2013: Playing Atari games with deep reinforcement learning

Related papers:

Playing Atari with Deep Reinforcement Learning [7]:

Implementation code:

PyTorch version:

TensorFlow version:

Illustration source:

Based on the recent development of image recognition and GPU, DeepMind has successfully trained a neural network that can play Atari games based on raw pixel input. Moreover, the same neural network can learn to play seven different games without setting any game rules, which also proves the versatility of the method.

Youtube video:

Among them, the difference between reinforcement learning and supervised learning (such as image classification) is that the agent of reinforcement learning must learn to maximize the sum of rewards over a period of time (such as a game), not just predict the label. Since its agent can directly interact with the environment, and each action will affect the next action, the training data is not independent and identically distributed. This also makes the training of many reinforcement learning models very unstable, but this problem can be solved by techniques such as experience replay.

Although there is no obvious algorithm innovation, this research cleverly combines various existing technologies, such as training convolutional neural networks and experience replay on GPU, as well as some data processing skills, and has achieved impressive results that exceed everyone’s expectations. Result . This also makes people more confident to expand deep reinforcement learning technology to solve more complex tasks, such as: Go, Dota 2, StarCraft 2, etc.

And since this paper, the Atari game has also become a test standard for reinforcement learning research. Although the original method surpassed human performance, it could only achieve such performance in 7 games. In the following years, these ideas were continuously expanded, defeating humans in more and more games. Until recently, technology has conquered all 57 games and surpassed all human levels. Among them, “Montezuma’s Revenge” is known for its long-term planning and is considered one of the most difficult games to overcome.

2014: Encoder-decoder network plus attention mechanism (Seq2Seq+Atten model)

Related papers:

Sequence to Sequence Learning with Neural Networks [4]:

Neural Machine Translation by Jointly Learning to Align and Translate [3]:


PyTorch version:

TensorFlow version:

Illustration source: Open source Seq2Seq framework in Tensorflow:

Many of the most impressive results in deep learning are vision-related tasks and are driven by convolutional neural networks. Although the NLP field has achieved some success in language models and translation through the use of LSTM and encoder-decoder architecture, it was not until the emergence of the attention mechanism that this field achieved truly remarkable achievements.

When processing language, each token (which can be a character, word or somewhere in between) will be fed into a recurrent network (such as LSTM), which stores the previously processed input. In other words, this is like a time series sentence, each token represents a time step. These cyclic models can easily “forget” earlier inputs when processing sequences, so it is difficult to deal with long-distance dependencies. Since the gradient needs to be propagated through many time steps, which will cause problems such as gradient explosion and gradient disappearance, it becomes difficult to optimize the loop model with gradient descent.

The introduction of attention mechanism helps to alleviate this problem. Through direct connection, it provides an adaptive method for the network to “review” earlier time steps. These connections allow the network to determine which inputs are important when generating specific outputs. Simply use translation as an example: when generating an output word, usually one or more specific input words are selected by the attention mechanism as an output reference.

2014-Adam optimizer

Related papers:

Adam: A Method for Stochastic Optimization [12]:


Python version:

PyTorch version: version:

Y axis — probability of optimal solution

X axis-budget for hyperparameter optimization (#model training)

Neural networks are generally trained by using the optimizer to minimize the loss function, and the role of the optimizer is to figure out how to adjust the network parameters so that it can learn the specified goals. Most optimizers are based on stochastic gradient descent (SGD) ( to improve. But it should be pointed out that many optimizers themselves also contain adjustable parameters, such as learning rate. Therefore, finding the correct settings for a specific problem not only reduces the training time, but also finds a better local optimization of the loss function, which often enables the model to obtain better results.

Previously, research laboratories with deep pockets usually had to run super-parameter searches that burned money to come up with a learning rate adjustment scheme for SGD. Although it can exceed the best performance before, it often means spending a lot of money to adjust the optimizer. These details are generally not mentioned in the paper, so those poor researchers who do not have the same budget to adjust the optimizer will always be stuck with poor results, and there is no way.

And Adam brings good news to these researchers, it can automatically adjust the learning rate through the first and second moments of the gradient. And the experimental results prove that it is very reliable, and it is not very sensitive to the choice of super parameters. In other words, Adam can be used right away, without the need for extensive parameter tuning like other optimizers. Although the tuned SGD may get better results, Adam makes the research easier. Because once there is a problem, you know that it is unlikely to be the problem caused by tuning.

2014/2015- Generative Adversarial Network (GAN)

Related papers:

Generative Adversarial Networks [6]:

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks [17]:


PyTorch version:

TensorFlow version:

Figure 2: Visualization of model samples. The rightmost column shows the neighboring samples closest to the training example to prove that the model does not remember the training set. The samples are drawn randomly rather than carefully selected. Unlike most other visualizations of deep generative models, these images show actual samples in the model distribution, rather than the conditional mean of a given hidden unit sample. And these samples are irrelevant, because the sampling process does not rely on Markov chain mixing, a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-IO (convolution discriminator and deconvolution Product generator)


The goal of generative models (such as Variational Autoencoders, VAE) is to generate fake and real data samples, such as non-existent faces. Here, the model must model the entire data distribution (a lot of pixels!), not just classify cats or dogs like a discriminative model, so such models are difficult to train. Generative Adversarial Network (GAN) is such a model.

The basic idea of ​​GAN is to train two networks at the same time, the generator and the discriminator. The goal of the generator is to generate samples that can deceive the discriminator, and the discriminator is trained to distinguish real images and generate images. As the training progresses, the discriminator will become better at identifying fake pictures, and the generator will become better at deceiving the discriminator and generating more realistic samples. This is where the confrontation network is. The initial GAN ​​produced blurry low-resolution images, and the training was quite unstable. However, as technology advances, variants and improvements similar to DCGAN[17], Wasserstein GAN[25], CycleGAN[26], StyleGAN(v2)[27], etc. can produce higher-resolution realistic images and videos.

2015-Residual Network (ResNet)

Related papers:

Deep Residual Learning for Image Recognition [13]:


PyTorch version: version:

On the basis of AlexNet, researchers have invented better-performing architectures based on convolutional neural networks, such as VGGNet [28], Inception [29], etc. And ResNet is the most important breakthrough in this series of progress. Today, the ResNet variant has been used as a benchmark model architecture for various tasks, as well as the basis for more complex architectures.

The reason why RseNet is special, in addition to its championship in the ILSVRC 2015 Classification Challenge, lies in its depth compared to other network architectures . The deepest network mentioned in the paper has 1000 layers. Although it is slightly worse than the 101 and 152 layers on the benchmark task, it still performs well. Because of the vanishing gradient problem, training such a deep network is actually very challenging, and the sequence model has the same problem. Before that, few researchers thought that training such a deep network could have such stable results.

ResNet uses shortcut connections to help gradient transfer. One understanding is that ResNet only needs to learn the “differential” from one layer to another, which is simpler than learning a complete conversion. In addition, the residual connection in ResNet is a special case of Highway Networks [30], and Highway Networks is inspired by the gating mechanism in LSTM.


Related papers:

Attention is All You Need [5]:


PyTorch version:

TensorFlow version: Transformers library:

Figure 1: Transformer-model architecture


The Seq2Seq+Atten model (described earlier) has good performance, but due to its recursive nature, it needs to be calculated in time series. So it is difficult to parallelize, only one step can be processed at a time, and each step depends on the previous one. This also makes it difficult to use on long-sequence data. Even with an attention mechanism, it is still difficult to model complex long-distance dependencies, and most of its work is implemented in the recursive layer.

Transformers directly solves these problems, losing the recursive part, replacing it with multiple feedforward self-attention layers, processing all inputs in parallel, and finding a relatively short (easy to optimize with gradient descent) path between input and output . This makes it very fast to train, easy to expand, and able to handle more data. In order to add input position information (implicit in the recursive model), Transformers also use position encoding. To learn more about how Transformer works, I recommend reading this illustrated blog.


It would be an insult to just say that Transformers performed better than almost everyone expected. Because in the next few years, it not only performed better, but also directly killed the RNN, became the standard architecture for most NLP and other sequence tasks, and even used it in machine vision.

2018 — BERT and fine-tuned NLP model

Related papers:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [9]:


Fine-tune BERT’s HuggingFace implementation:

Pre-training refers to training a model to perform a certain task, and then using the learned parameters as initialization parameters to learn a related task . This is actually very intuitive. A model that has learned to classify images of cats or dogs should have learned some basic knowledge about images and furry animals. When this model is micro-called to classify foxes, it can be expected to be better than a model learned from scratch. Similarly, a model that has learned to predict the next word in a sentence should have learned some knowledge about human language models. Then its parameters will also be a good initialization for related tasks (such as translation or sentiment analysis).

Pre-training and fine-tuning have been successful in the field of computer vision and NLP. Although it has long become a standard in computer vision, it seems that there are still some challenges in how to play a better role in the field of NLP. Most of the best results still come from fully supervised models. With the emergence of ELMo [34], ULMFiT [35] and other methods, NLP researchers can finally start to do pre-training work (the word vector is actually counted before), especially the application of Transformer, it has produced a series of Such as GPT and BERT methods.

BERT is a relatively new result of pre-training, and many people believe that it has opened a new era of NLP research. It does not train to predict the next word like most pre-trained models, but predicts the words that are masked (deliberately deleted) in the sentence and whether two sentences are adjacent. Note that these tasks do not require annotated data, it can be trained on any text, and it can be a lot of text! So the pre-trained model can learn some general properties of the language, and then it can be fine-tuned. Used to solve supervisory tasks such as question answering or emotion prediction. BERT’s performance in various tasks is very good, and it will kill the list when it comes out . Companies like HuggingFace are also on the cusp of making fine-tuning BERT models for NLP tasks easy to download and use. Since then, BERT has been extolled by new models such as XLNet[31], RoBERTa[32] and ALBERT[33], and now basically everyone in the field knows it.

2019/2002 and the future-BIG language model, self-supervised learning?

Vertical view of the entire depth study of the history, the most obvious trend is perhaps Sutton said the bitter lesson (painful lesson). As mentioned inside, algorithms that can use better parallelism (more data) and more model parameters can beat some so-called “smarter technologies” time and time again. This trend seems to continue until 2020. OpenAI’s GPT-3 model, a huge language model with 175 billion parameters, has shown unexpected generalization despite its simple training objectives and architecture ( Various effects are very good demo).

There are similar trends in contrastive self-supervised learning and other methods, such as SimCLR ( , which can make better use of unlabeled data. As the model becomes larger and the training speed becomes faster and faster, it is also becoming more and more valuable to make effective use of a large number of unlabeled data sets on the Internet and to learn techniques that can transfer general knowledge.

Related reports: