Research Papers in Artificial Intelligence

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

Published in

On Technology

13 min readJul 1, 2024

This article is the third in a series of articles where I present a history of Artificial Intelligence, by reviewing the most important research papers in the field.

AlexNet

ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012)

We trained a large, deep convolutional neural network to classify the 1.3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 39.7\% and 18.9\% which is considerably better than the previous state-of-the-art results. The neural network, which has 60 million parameters and 500,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and two globally connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally connected layers we employed a new regularization method that proved to be very effective.

A large, deep convolutional neural network was trained to classify 1.3 million high-resolution images from the ImageNet dataset into 1000 different categories. The network, which includes 60 million parameters and 500,000 neurons, consists of five convolutional layers, some with max-pooling, followed by two fully connected layers and a final 1000-way softmax. The network achieved significantly lower error rates compared to previous methods by using non-saturating neurons, efficient GPU implementations, and a new regularization technique to reduce overfitting.

This paper demonstrated the potential of deep convolutional neural networks to achieve unprecedented accuracy in image classification, significantly advancing the field of computer vision and influencing numerous subsequent AI research and applications.

Deep convolutional neural network: A type of artificial neural network specifically designed to process structured grid data, such as images, by using multiple layers to automatically learn and extract features.
LSVRC-2010 ImageNet training set: A large dataset containing 1.3 million labeled high-resolution images used for training machine learning models to recognize various objects across 1000 different categories.
Top-1 error rate: The percentage of test images for which the correct label was not the most probable predicted label.
Max-pooling layers: Layers that reduce the spatial dimensions of the data by selecting the maximum value from a group of neighboring pixels, making the model more efficient and less sensitive to small changes.
Softmax: A mathematical function that converts the output of the network into probabilities for each class, summing to 100%.

Word Representations in Vector Space

Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013)

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

The paper introduces two innovative model architectures for generating continuous vector representations of words from large datasets. These models demonstrate superior performance in word similarity tasks compared to previous methods, achieving high accuracy with significantly reduced computational costs. The study also shows that the generated word vectors excel in syntactic and semantic similarity tests, highlighting their practical utility.

This paper is crucial because it significantly advanced the field of natural language processing by introducing efficient methods for word embeddings, which have become foundational in many modern AI applications, including language models, search engines, and recommendation systems.

Continuous vector representations: A way to represent words as vectors (points) in a multi-dimensional space, allowing computers to understand and manipulate the meanings of words more effectively.
Word similarity task: A test to determine how well the model understands the similarity between different words, akin to how humans perceive word relationships.
Word embeddings: Another term for continuous vector representations, where words are represented as vectors in a way that captures their meanings and relationships.
Syntactic similarity: The degree to which words are related in terms of grammar and sentence structure.
Semantic similarity: The degree to which words are related in terms of meaning and context.

Variational Autoencoders

Auto-Encoding Variational Bayes, by Diederik P Kingma, Max Welling (2013)

How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.

This paper introduces a novel stochastic variational inference and learning algorithm designed for efficient inference in probabilistic models with continuous latent variables and large datasets. The approach involves reparameterizing the variational lower bound to enable optimization via standard stochastic gradient methods and using an approximate inference model to fit the intractable posterior distribution, making posterior inference more efficient.

This paper is significant as it provided a scalable and efficient method for variational inference in complex models, influencing subsequent developments in machine learning, particularly in the training of generative models like Variational Autoencoders (VAEs).

Directed Probabilistic Models: Models that use directed graphs to represent variables and their conditional dependencies.
Continuous Latent Variables: Hidden variables in a model that can take any value within a continuous range.
Intractable Posterior Distributions: Probability distributions that are too complex to compute exactly.
Stochastic Variational Inference: A method that uses randomness and optimization techniques to approximate complex probability distributions.
Reparameterization of the Variational Lower Bound: A technique that transforms a complex problem into a simpler form that can be optimized more easily.
Standard Stochastic Gradient Methods: Common optimization techniques that use random samples to find the minimum or maximum of a function.
Approximate Inference Model (Recognition Model): A model used to estimate the intractable posterior distribution in a more computationally manageable way.

Generative Adversarial Networks

Generative Adversarial Nets, by Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (2014)

We propose a new framework for estimating generative models via adversarial nets, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitatively evaluation of the generated samples.

This paper introduces a novel framework for estimating generative models through adversarial networks, where a generative model (G) learns to produce data similar to the training set and a discriminative model (D) learns to distinguish between real and generated data. The generative model is trained to maximize the likelihood of the discriminative model making incorrect classifications, effectively creating a minimax game. By using multilayer perceptrons and backpropagation, the system can train without complex inference methods, showcasing its effectiveness through various experiments.

This paper is crucial as it laid the foundation for Generative Adversarial Networks (GANs), significantly advancing the field of AI by enabling the generation of highly realistic synthetic data and inspiring a plethora of research and applications in image synthesis, data augmentation, and more.

Generative model (G): A system designed to create data that mimics a given dataset.
Discriminative model (D): A system trained to differentiate between real data and data generated by the generative model.
Minimax game: A strategic game where one player’s gain is another player’s loss, and both aim to minimize their maximum possible loss.
Approximate inference networks: Methods used to estimate complex probabilities, often required in traditional generative models but not in GANs.

Dropout

Dropout: A Simple Way to Prevent Neural Networks from Overfitting, by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov (2014)

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different thinned networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Dropout is a technique designed to prevent overfitting in deep neural networks by randomly dropping units and their connections during training, which forces the network to learn more robust features. At test time, the effect of averaging multiple thinned networks is approximated using a single network with adjusted weights, leading to significant improvements in performance across various supervised learning tasks.

This paper introduced a simple yet powerful regularization method that significantly enhanced the robustness and generalization of neural networks, profoundly impacting the development and performance of AI models in various domains.

Dropout: A method where random neurons in the network are ignored during training, forcing the remaining neurons to learn better.
Overfitting: A situation where a model learns the training data too well, including noise and details, which reduces its performance on new data.
Thinned Networks: Variants of the original neural network where some neurons and connections are randomly removed during training.

Long Short-Term Memory

Sequence to Sequence Learning with Neural Networks, by Ilya Sutskever, Oriol Vinyals, Quoc V. Le (2014)

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

The paper introduces an end-to-end approach to sequence learning using multilayered Long Short-Term Memory (LSTM) networks. It demonstrates how these LSTM networks can map input sequences to fixed-dimensional vectors and then decode these vectors into target sequences, achieving significant performance in English to French translation tasks and surpassing traditional phrase-based statistical machine translation (SMT) systems.

This paper is important because it demonstrated the effectiveness of LSTM networks in handling sequence-to-sequence tasks, paving the way for advancements in neural machine translation and other sequence-based AI applications.

Sequence Learning: Teaching a model to understand and predict ordered data points, like sentences in a language.
Multilayered Long Short-Term Memory (LSTM): A type of artificial neural network designed to remember information over long periods, making it effective for tasks involving sequences.
Vector of Fixed Dimensionality: A list of numbers with a set length that represents complex data in a simplified form.
BLEU Score: A metric for evaluating the quality of text generated by a machine, compared to human-written text.
Phrase-Based Statistical Machine Translation (SMT): An older method of translating text using statistical models of phrases.
Hypotheses Reranking: Improving translation quality by reordering the possible translations generated by a model.
Word Order Sensitivity: The model’s ability to recognize the correct sequence of words.
Reversing Source Sentence Order: A technique that improves model performance by changing the order of words in the input sentence, making the learning process easier.

Soft-Align

Neural Machine Translation by Jointly Learning to Align and Translate, by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2014)

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

This paper presents a novel approach to neural machine translation (NMT) that addresses the limitations of traditional encoder-decoder models by introducing a mechanism for the model to automatically focus on relevant parts of a source sentence when generating translations. This soft alignment method enhances translation performance by enabling the model to dynamically identify and utilize pertinent information, leading to more accurate and intuitive translations compared to previous methods.

This paper is significant because it introduced the concept of attention mechanisms in neural networks, which revolutionized not only machine translation but also numerous other areas in artificial intelligence by improving the ability of models to process and generate sequences of data.

Neural Machine Translation (NMT): A method for translating languages using neural networks, which can learn to translate text end-to-end.
Encoder-Decoder Model: A type of neural network architecture where an encoder processes the input and a decoder generates the output.
Fixed-Length Vector: A compressed representation of the input sentence in a fixed size, which limits the amount of information the model can use.
Soft Alignment: A technique where the model dynamically focuses on different parts of the input sentence when generating each word of the translation, rather than relying on a fixed representation.
State-of-the-Art Phrase-Based System: The best-performing translation systems before neural methods, which relied on breaking down sentences into phrases and translating them.
Soft-Search Mechanism: A method that allows the model to selectively attend to relevant parts of the input without explicitly dividing the sentence into segments.

Adam

Adam: A Method for Stochastic Optimization, by Diederik P. Kingma, Jimmy Ba (2014)

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

This paper introduces Adam, an efficient algorithm for first-order gradient-based optimization in stochastic settings, which adapts estimates of lower-order moments to enhance performance. Adam is computationally efficient, requires little memory, and is effective for large-scale problems with noisy or sparse gradients. It offers intuitive hyper-parameter settings and demonstrates strong empirical performance compared to other optimization methods. The paper also explores theoretical aspects, including convergence properties and regret bounds, and presents AdaMax, a variant using the infinity norm.

This paper is crucial as it provides a robust and widely adopted optimization algorithm that significantly improved the training efficiency and performance of machine learning models, particularly in deep learning.

Stochastic optimization: A method of finding optimal solutions using random samples of data, which helps in handling large datasets efficiently.
First-order gradient-based optimization: An approach that uses the gradient (rate of change) of the function to iteratively find the minimum or maximum.
Adaptive estimates of lower-order moments: Adjusting calculations based on recent changes in the data to improve the optimization process.
Diagonal rescaling of the gradients: Adjusting the gradient values in a way that treats each dimension independently to improve optimization.
Non-stationary objectives: Problems where the target or objective function changes over time.
Regret bound: A measure of how much worse an algorithm performs compared to the best possible strategy in hindsight.
AdaMax: A variant of Adam that uses the infinity norm, which considers the maximum absolute value of the gradients to stabilize updates.

Thanks for Reading! Feedback appreciated! Especially, if you think I’ve missed any important research.

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

medium.com

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

medium.com

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

medium.com

A History of AI (Part 1)

1950 to 2000

medium.com

Research Papers in Artificial Intelligence

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

AlexNet

Word Representations in Vector Space

Variational Autoencoders

Generative Adversarial Networks

Dropout

Long Short-Term Memory

Soft-Align

Adam

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

A History of AI (Part 1)

1950 to 2000

Written by Nuwan I. Senaratna