Research Papers in Artificial Intelligence

A History of AI (Part 1)

1950 to 2000 (Perceptrons to Convolutional Neural Networks)

Published in

On Technology

10 min readJun 27, 2024

This article is the first in a series of articles where I present a history of Artificial Intelligence, by reviewing the most important research papers in the field.

Perceptron

The perceptron: A probabilistic model for information storage and organization in the brain, by Frank Rosenblatt (1958)

To answer the questions of how information about the physical world is sensed, in what form is information remembered, and how does information retained in memory influence recognition and behavior, a theory is developed for a hypothetical nervous system called a perceptron. The theory serves as a bridge between biophysics and psychology. It is possible to predict learning curves from neurological variables and vice versa. The quantitative statistical approach is fruitful in the understanding of the organization of cognitive systems.

This paper laid the groundwork for the development of more complex neural networks (like deep neural networks) and machine learning algorithms, demonstrating how machines can learn from data.

A perceptron is the foundational unit of an artificial neural network. It can learn and make decisions by adjusting weights based on input data.

Back-Propagation

Learning representations by back-propagating errors, David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams (1986)

We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

This paper revolutionized the field of artificial neural networks by introducing the back-propagation algorithm, which allowed for the training of multi-layer networks (networks with multiple layers of neurons) and paved the way for the development of deep learning.

Back-propagation is a learning procedure for neural networks that adjusts the weights of connections by minimizing the difference between the actual and desired output vectors.

Its significance lies in its ability to create useful new features by enabling internal hidden units to represent important aspects of the task domain, distinguishing it from earlier methods like the perceptron-convergence procedure.

Efficient implementations of back-propagation are usually made possible by the chain rule of calculus, which allows for the systematic calculation of gradients needed to adjust the network’s weights

Decision Trees

Induction of decision trees, by J. R. Quinlan (1986)

The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

This paper significantly influenced the future of the field by introducing the ID3 algorithm, which became a foundational method for constructing decision trees and spurred further research into improving decision tree algorithms to handle noisy and incomplete data.

A decision tree is a model used for classification and regression tasks that makes decisions based on a hierarchical structure of rules derived from the features of the input data.

Efficiently synthesizing a decision tree consists of constructing the tree in a way that optimizes for accuracy and computational resources, effectively handling noisy and incomplete information.

ID3 (Iterative Dichotomiser 3) is an algorithm for generating decision trees by using a top-down, greedy approach to recursively partition the data based on the attribute that maximizes information gain.

Information gain is a metric used to measure the effectiveness of an attribute in classifying a dataset by quantifying the reduction in entropy (uncertainty) when the dataset is split based on that attribute; maximizing information gain means selecting the attribute that provides the highest reduction in entropy, thus improving the decision tree’s accuracy.

Hidden Markov Models

A tutorial on hidden Markov models and selected applications in speech recognition, L.R. Rabiner (1989)

This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described.

This paper provided a comprehensive and practical guide to hidden Markov models (HMMs), making their application in speech recognition and other areas more accessible and understandable to researchers and practitioners.

A Discrete Markov Chain is a stochastic process consisting of a sequence of states where the probability of transitioning to the next state depends only on the current state and not on the preceding states.

A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is assumed to follow a Markov process with unobservable (hidden) states, and the intuition is that observed data are probabilistically linked to these hidden states.

The three fundamental problems of HMMs are: the evaluation problem (calculating the probability of an observed sequence), the decoding problem (determining the most likely sequence of hidden states), and the learning problem (estimating the model parameters from data).

Rabiner applied HMMs to speech recognition by modeling speech signals as sequences of hidden states representing phonemes (smallest unit of sound in a language), allowing for the probabilistic decoding and recognition of spoken words based on observed acoustic signals.

Multilayer Feedforward Networks

Multilayer feedforward networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe, and Halbert White (1989)

This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.

This paper established the theoretical foundation for the universal approximation capability of neural networks, demonstrating that even simple neural network architectures can approximate complex functions to any desired level of accuracy, which has paved the way for the extensive use and development of neural networks in various applications.

A standard multilayer feedforward network is a type of artificial neural network where connections between nodes do not form cycles, consisting of an input layer, one or more hidden layers, and an output layer.

A Borel measurable function is a function that maps elements from one topological space to another, such that the preimage of any Borel set (a set in a σ-algebra generated by open sets) is also a Borel set. More intuitively, Borel measurability ensures that the function behaves in a predictable and measurable way.

An arbitrary squashing function is a type of activation function used in neural networks that maps a real-valued input to a bounded output, typically compressing inputs into a finite range, such as the sigmoid or tanh functions.

A universal approximator is a model or mathematical function capable of approximating any function to a desired level of accuracy given sufficient resources, such as enough hidden units in the context of neural networks.

Support Vector Machines

A training algorithm for optimal margin classifiers, by Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik (1992)

A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique is applicable to a wide variety of the classification functions, including Perceptrons, polynomials, and Radial Basis Functions. The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the subset of training patterns that are closest to the decision boundary. Bounds on the generalization performance based on the leave-one-out method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other learning algorithms.

This paper introduced Support Vector Machines (SVMs), which became a fundamental tool for classification tasks due to their ability to maximize the margin between classes and automatically adjust model complexity, leading to improved generalization and robustness in various applications, such as optical character recognition.

A Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression that finds the optimal hyperplane to maximize the margin between different classes.

The SVM algorithm works by identifying the hyperplane that maximizes the margin between classes using supporting patterns, and it can use kernel functions to handle non-linear classification problems.

The margin is the distance between the decision boundary (the line or surface separating different classes) and the closest training patterns, with a larger margin generally leading to better generalization. A supporting pattern is a training pattern that lies closest to the decision boundary and directly influences its position and orientation.

The leave-one-out method is a cross-validation technique where the model is trained on all but one training pattern, and this process is repeated for each pattern to estimate the model’s generalization performance.

The VC-dimension (Vapnik-Chervonenkis dimension) is a measure of the capacity or complexity of a set of functions, representing the largest number of points that can be shattered (perfectly classified) by the model.

Bagging

Bagging predictors, by Leo Breiman (1996)

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

This paper significantly influenced the field of machine learning by introducing the concept of ensemble methods, which improve prediction accuracy by combining multiple models, particularly benefiting from techniques like bootstrap replicates and highlighting the importance of predictor instability.

An ensemble method in machine learning is a technique that combines multiple models to improve overall prediction accuracy and robustness compared to using a single model.

Bagging (short for Bootstrap Aggregating) is a method for generating multiple versions of a predictor and using these to create an aggregated predictor which has improved accuracy.

Bagging works because it reduces the variance of the prediction by averaging multiple versions of the model, each trained on different subsets of the data, thereby smoothing out noise and mitigating overfitting.

Bootstrapping is a statistical method that involves sampling with replacement from a dataset to create multiple new training sets, and a bootstrap replicate is one such new training set.

Convolutional Neural Networks

Gradient-based learning applied to document recognition, by Yann LeCun, Léon Bottou, Yoshua Bengio, and P. Haffner (1998)

Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

This paper influenced the future of AI by demonstrating the superior performance of convolutional neural networks (CNNs) for recognizing 2D shapes, particularly in handwritten character recognition, and introducing the concept of graph transformer networks (GTNs) for training complex, multimodule systems globally, which paved the way for advancements in document recognition and other practical applications.

A convolutional operation is a mathematical process that involves applying a filter to an input (such as an image) to produce an output feature map that highlights certain aspects of the input (like edges in an image).

Convolutional Neural Networks (CNNs) are a class of deep neural networks specifically designed to process and analyze data with a grid-like topology, such as images, by using convolutional layers to automatically and adaptively learn spatial hierarchies of features.

High-dimensional patterns refer to data that have a large number of attributes or features, making them complex and challenging to classify or analyze, such as handwritten characters.

Graph Transformer Networks (GTNs) are a type of learning paradigm that enables the global training of multimodule systems by using gradient-based methods to optimize an overall performance measure.

Global training is a process where multiple components or modules of a system are trained simultaneously using a unified objective, allowing the entire system to be optimized holistically rather than in isolated parts.

Thanks for Reading! Feedback appreciated! Especially, if you think I’ve missed any important research.

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

medium.com

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

medium.com

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

medium.com

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

medium.com

Research Papers in Artificial Intelligence

A History of AI (Part 1)

1950 to 2000 (Perceptrons to Convolutional Neural Networks)

Perceptron

Back-Propagation

Decision Trees

Hidden Markov Models

Multilayer Feedforward Networks

Support Vector Machines

Bagging

Convolutional Neural Networks

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

Written by Nuwan I. Senaratna