Research Papers in Artificial Intelligence

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

Nuwan I. Senaratna
On Technology

--

This article is the second in a series of articles where I present a history of Artificial Intelligence, by reviewing the most important research papers in the field.

Random Forests

Random Forests, by Leo Breiman (2001)

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

This paper established Random Forests as a powerful and robust method for classification and regression, highlighting their advantages over other methods like Adaboost.

A Random Forest is an ensemble learning method that combines multiple decision trees, each trained on a random subset of features, to improve the overall predictive accuracy and control overfitting.

The Generalization Error of a Forest is the error rate that a Random Forest model achieves on unseen data, which decreases and converges as the number of trees in the forest increases.

Adaboost (Adaptive Boosting) is an ensemble learning algorithm that combines multiple weak classifiers, usually decision trees, by iteratively adjusting their weights to focus on the most difficult samples, thereby improving the overall model’s performance. Random Forests are considered better in some respects because they are more robust to noise, do not overfit as easily, and provide internal estimates to monitor error, strength, correlation, and variable importance, offering a more stable and accurate prediction compared to Adaboost.

Evolutionary Algorithms

A fast and elitist multiobjective genetic algorithm: NSGA-II, by K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan (2002)

Multi-objective evolutionary algorithms (MOEAs) that use non-dominated sorting and sharing have been criticized mainly for: (1) their O(MN/sup 3/) computational complexity (where M is the number of objectives and N is the population size); (2) their non-elitism approach; and (3) the need to specify a sharing parameter. In this paper, we suggest a non-dominated sorting-based MOEA, called NSGA-II (Non-dominated Sorting Genetic Algorithm II), which alleviates all of the above three difficulties. Specifically, a fast non-dominated sorting approach with O(MN/sup 2/) computational complexity is presented. Also, a selection operator is presented that creates a mating pool by combining the parent and offspring populations and selecting the best N solutions (with respect to fitness and spread). Simulation results on difficult test problems show that NSGA-II is able, for most problems, to find a much better spread of solutions and better convergence near the true Pareto-optimal front compared to the Pareto-archived evolution strategy and the strength-Pareto evolutionary algorithm — two other elitist MOEAs that pay special attention to creating a diverse Pareto-optimal front. Moreover, we modify the definition of dominance in order to solve constrained multi-objective problems efficiently. Simulation results of the constrained NSGA-II on a number of test problems, including a five-objective, seven-constraint nonlinear problem, are compared with another constrained multi-objective optimizer, and the much better performance of NSGA-II is observed.

This paper significantly advanced the field of multi-objective optimization by introducing NSGA-II, a more efficient and effective algorithm, setting a new standard for solving complex optimization problems.

An Evolutionary Algorithm is a subset of artificial intelligence that mimics the process of natural selection to solve optimization problems, while a Multi-objective Evolutionary Algorithm is a type of evolutionary algorithm that optimizes multiple conflicting objectives simultaneously.

The new technique, NSGA-II, was effective because it reduced computational complexity, incorporated elitism (ensuring the best solutions are preserved), and eliminated the need for a sharing parameter, resulting in better convergence and diversity in solutions.

Latent Dirichlet Allocation

Latent Dirichlet Allocation, by David M. Blei, Andrew Y. Ng and Michael I. Jordan (2003)

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

This paper significantly influenced the future of the field by providing a robust framework for topic modeling in text corpora, leading to advancements in various applications such as document classification, recommendation systems, and natural language processing.

A generative probabilistic model is a statistical model that explains how observed data is generated based on underlying random processes, allowing for the creation of new, similar data points.

A Bayesian model is a statistical model that incorporates prior knowledge or beliefs, updating them with new data; a Hierarchical Bayesian model extends this by having multiple levels of prior distributions, allowing for more complex data structures and dependencies.

LDA works by assuming each document is a mixture of topics and each topic is a mixture of words, providing a more flexible and interpretable way to model text data compared to previous methods like the mixture of unigrams and probabilistic Latent Semantic Indexing (pLSI).

The EM (Expectation-Maximization) algorithm is an iterative optimization technique used to find maximum likelihood estimates of parameters in models with latent variables, involving two steps: the Expectation step (estimating the latent variables) and the Maximization step (optimizing the parameters based on these estimates).

Dimensionality Reduction

Reducing the Dimensionality of Data with Neural Networks, by Geoffrey E. Hinton and R. R. Salakhutdinov (2006)

High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

This paper influenced the future of the field by demonstrating that deep autoencoder networks, when properly initialized, can effectively learn low-dimensional representations of data, outperforming traditional methods like principal components analysis (PCA) in reducing dimensionality.

Dimensionality refers to the number of variables or features in a dataset, and reducing it is useful because it simplifies the data, making it easier to analyze and visualize while also reducing computational costs and mitigating the risk of overfitting; high-dimensional data means data with a large number of features or variables.

An autoencoder is a type of neural network designed to learn a compressed, low-dimensional representation (encoding) of data and then reconstruct the original input from this compressed version.

Principal components analysis (PCA) is a statistical method that transforms data into a set of orthogonal (independent) components ordered by the amount of variance they capture, and the approach described in the paper is better because it can capture more complex, non-linear relationships in the data.

High Dimensional Data Visualization

Visualizing Data using t-SNE, by Laurens van der Maaten and Geoffrey Hinton (2008)

We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets.

This paper introduced t-SNE, a technique that has become a standard for visualizing high-dimensional data due to its ability to reveal structure at multiple scales and produce clearer, more interpretable maps.

Visualizing high-dimensional data is important for understanding complex relationships and structures within the data, but it is difficult because high-dimensional spaces are challenging to represent in two or three dimensions without losing important information.

Stochastic Neighbor Embedding (SNE) is a method for visualizing high-dimensional data by converting similarities between data points into probabilities and then mapping these probabilities into a lower-dimensional space.

The new technique called “t-SNE” improves upon SNE by being easier to optimize, reducing the tendency to crowd points in the center, and producing better visualizations that reveal structures at different scales more effectively.

ImageNet

ImageNet: A large-scale hierarchical image database, by Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei (2009)

The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500–1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

This paper significantly influenced the future of the field by providing a massive and diverse dataset that enabled the development of more advanced and accurate models for image recognition and classification.

The explosion of image data on the Internet was caused by the proliferation of digital cameras, smartphones, and social media platforms, and this is beneficial because it provides vast amounts of data that can be used to train and improve machine learning algorithms.

WordNet is a large lexical database of English that groups words into sets of synonyms called synsets, and ImageNet is similar because it uses this structure to organize millions of images into corresponding synsets, creating a hierarchical and semantically rich image database.

Thanks for Reading! Feedback appreciated! Especially, if you think I’ve missed any important research.

DALLE

--

--

Nuwan I. Senaratna
On Technology

I am a Computer Scientist and Musician by training. A writer with interests in Philosophy, Economics, Technology, Politics, Business, the Arts and Fiction.