Research Papers in Artificial Intelligence

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

Published in

On Technology

15 min readJul 8, 2024

This article is the 4th in a series of articles where I present a history of Artificial Intelligence, by reviewing the most important research papers in the field.

#AIHistory #DeepLearning #MachineLearning #AIPapers #TechInnovation

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, by Sergey Ioffe and Christian Szegedy (2015)

Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a stateof-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

This research discovered a way to speed up and improve the training of deep learning models, which are used in various applications like image recognition. They found that by normalizing the inputs to each layer of the model during training, they could make the training process faster and more efficient.

This technique, called Batch Normalization, also made the models more accurate and sometimes even eliminated the need for other methods like Dropout. With this approach, they were able to achieve impressive results, even surpassing human accuracy in some cases.

This paper was influential because it introduced Batch Normalization, a technique that significantly improved the training efficiency and performance of deep neural networks. It addressed a major challenge in deep learning by reducing internal covariate shift, allowing for faster training, higher learning rates, and improved accuracy. This innovation made it easier to train deep learning models and contributed to the rapid advancements in AI applications.

Deep Neural Networks: A type of artificial intelligence model that mimics the human brain’s neural networks, used for tasks like image and speech recognition.
Internal Covariate Shift: Changes in the distribution of inputs to each layer of the neural network during training, which can slow down the learning process.
Normalization: A process of adjusting values measured on different scales to a common scale, which helps in stabilizing and speeding up training.
Mini-batch: A small subset of the training data used to update the model’s parameters in each training iteration.
Learning Rate: A parameter that controls how much the model’s parameters are adjusted with respect to the loss gradient during training.
Dropout: A regularization technique where randomly selected neurons are ignored during training to prevent overfitting.
Top-5 Test Error: A metric used in classification tasks, where the model’s prediction is considered correct if the true label is among its top 5 predicted labels.
ImageNet: A large database used for visual object recognition software research, containing millions of labeled images. See also Part 2.

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

medium.com

Inception

Going Deeper With Convolutions, by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich (2015)

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation of this architecture, GoogLeNet, a 22 layers deep network, was used to assess its quality in the context of object detection and classification.

This research introduces a new deep learning model codenamed Inception, which is highly efficient in recognizing and classifying images. The design cleverly maximizes the use of computing resources, allowing the network to be deeper and wider without requiring more computational power. A specific version of this model, GoogLeNet, demonstrated superior performance in a major image recognition competition by processing images at multiple scales simultaneously.

This paper was influential because it introduced a novel neural network architecture that significantly improved image classification and detection performance. The Inception model’s efficient use of computational resources and its ability to process information at various scales set a new benchmark for neural network design, influencing subsequent advancements in AI and deep learning.

Deep Convolutional Neural Network (CNN): A type of neural network specifically designed to process data with a grid-like structure, such as images. It uses multiple layers to automatically and adaptively learn spatial hierarchies of features. See also Part 1.
ImageNet Large-Scale Visual Recognition Challenge (ILSVRC2014): An annual competition that evaluates algorithms for object detection and image classification at a large scale, using a dataset of millions of labeled images.
Computational Budget: The amount of computational resources (such as processing power and memory) that are allocated for running a neural network model.
Hebbian Principle: A theory in neuroscience suggesting that neurons that fire together, wire together. In the context of neural networks, it refers to the idea that synaptic efficacy increases when the presynaptic and postsynaptic neurons are activated simultaneously.
Multi-Scale Processing: An approach in image processing where the system analyzes information at various levels of detail or scale. This helps in recognizing objects regardless of their size in the image.
GoogLeNet: A specific implementation of the Inception architecture, which is 22 layers deep and was used to achieve state-of-the-art results in image classification and detection tasks.

A History of AI (Part 1)

1950 to 2000 (Perceptrons to Convolutional Neural Networks)

medium.com

Deep Q

Human-level control through deep reinforcement learning, by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Demis Hassabis (2015)

The theory of reinforcement learning provides a normative account1, deeply rooted in psychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems4,5, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms3. While reinforcement learning agents have achieved some successes in a variety of domains6,7,8, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks9,10,11 to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games12. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

This research describes the creation of a computer program that can learn to play and excel at a wide range of Atari 2600 video games just by looking at the game screen and score, similar to how a human would. The program uses advanced techniques in artificial intelligence to understand the game environment and develop strategies to win, achieving performance levels comparable to a professional human player.

This paper was groundbreaking because it demonstrated the potential of combining deep learning and reinforcement learning to create intelligent agents that can learn complex tasks directly from raw sensory inputs, such as pixels from video games. It showed that artificial intelligence could achieve human-level performance in a wide variety of tasks without requiring hand-designed features, paving the way for more general and adaptable AI systems.

Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions and receiving rewards or penalties.
High-Dimensional Sensory Inputs: Complex and detailed inputs like images or sounds that have many features, making them difficult for computers to process directly.
Hierarchical Sensory Processing Systems: Biological systems, like the human brain, that process sensory information in stages, from simple to complex.
Phasic Signals: Rapid bursts of activity from neurons, often associated with reward processing in the brain.
Temporal Difference Reinforcement Learning: An RL algorithm where the agent learns by comparing predicted rewards to the actual rewards received over time.
Deep Q-Network (DQN): A type of deep neural network specifically designed to learn optimal actions in reinforcement learning tasks directly from high-dimensional inputs like images.
End-to-End Reinforcement Learning: Training an AI system to learn directly from raw inputs to actions without requiring intermediate feature extraction or manual adjustments.
Classic Atari 2600 Games: Early video games used in this study as a testing ground for AI, known for their simplicity in graphics but complexity in gameplay.
Policies: Strategies or sets of rules that an AI agent follows to decide its actions in a given situation.
Hyperparameters: Settings or configurations that control the learning process of an AI model, such as learning rate or network architecture.

Region-based Convolutional Neural Network

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, by Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun (2015)

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. Code is available at https://github.com/ShaoqingRen/faster_rcnn.

This research introduces a method for object detection in images that significantly speeds up the process by integrating region proposal and object detection into a single, efficient system. By sharing the same convolutional layers for both tasks, the method can quickly generate region proposals and detect objects with high accuracy, achieving impressive results on standard benchmarks.

This paper was influential in the AI field because it dramatically improved the speed and accuracy of object detection systems. By combining region proposal generation and object detection into one network, it paved the way for real-time object detection applications and influenced many subsequent developments in computer vision and deep learning.

Region Proposal Algorithms: Techniques used to identify parts of an image that are likely to contain objects.
R-CNN (Region-based Convolutional Neural Network): A method for object detection that first proposes candidate regions in an image and then uses a convolutional neural network to classify and refine these regions to identify objects.
SPPnet (Spatial Pyramid Pooling Network): A network that improves the efficiency of object detection by allowing the input of images of varying sizes.
Fast R-CNN: An advanced object detection model that improves the speed and accuracy of the original R-CNN.
Region Proposal Network (RPN): A network that generates region proposals by predicting object locations and scores in an image.
Fully-Convolutional Network: A type of neural network that uses only convolutional layers, making it efficient for tasks like image recognition.
Object Bounds: The coordinates that define the rectangle around an object in an image.
Objectness Scores: Scores that indicate the likelihood of a region containing an object.
End-to-End Training: A training process where all parts of a system are trained simultaneously rather than in separate stages.
Alternating Optimization: A training strategy where different parts of the network are trained in turns to optimize the system.
VGG-16 Model: A deep convolutional network with 16 layers that is known for its high performance in image recognition tasks.
Frame Rate: The speed at which the system processes images, measured in frames per second (fps).
mAP (Mean Average Precision): A metric used to evaluate the accuracy of object detection models.
PASCAL VOC: A standard dataset used to benchmark the performance of object detection algorithms.

U-Net

U-Net: Convolutional Networks for Biomedical Image Segmentation, by Olaf Ronneberger, Philipp Fischer, Thomas Brox (2015)

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at this http URL .

This research introduces a new method for analyzing medical images called U-Net. It’s a type of artificial intelligence that can identify and outline different structures within an image, such as cells or neurons. By cleverly using limited examples and enhancing them with data augmentation, this method can be trained to perform very well, even with few original images. It’s fast and precise, making it ideal for tasks like tracking cells in medical research.

This paper was influential because it presented a novel and efficient way to perform image segmentation, which is crucial for many applications in biomedical research. The U-Net architecture’s ability to deliver accurate results with limited data and its speed of processing made it a cornerstone in the development of deep learning techniques for medical imaging and beyond. Its success in winning multiple challenges demonstrated its practical effectiveness and set a new standard in the field.

Data augmentation: Techniques to create new training data from existing data by applying random transformations like rotation or scaling, helping the AI model learn better.
Contracting path: Part of the U-Net that reduces the image size step-by-step, capturing important features.
Expanding path: The complementary part that increases the image size back to its original dimensions, allowing precise localization of features.
Sliding-window convolutional network: An older method where a small window moves across the image to classify parts of it, less efficient compared to U-Net.
ISBI challenge: A competitive event where researchers test their methods on biomedical image segmentation tasks.
Transmitted light microscopy: Techniques that use light passing through samples to create images, important in biology and medical research.
Phase contrast and DIC: Specific types of light microscopy techniques that enhance the contrast in transparent specimens.
GPU (Graphics Processing Unit): A powerful processor used to perform rapid calculations, essential for training and running deep learning models.

Residual Learning

Deep Residual Learning for Image Recognition, by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015)

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers — -8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

This research introduces a new method for training very deep neural networks, called residual learning, which makes it easier and more effective to train networks with many layers. The technique improves accuracy by addressing the problem of vanishing gradients, where deeper networks typically become harder to optimize. Using this method, the authors achieved groundbreaking results in image recognition competitions, significantly improving performance on challenging datasets.

This paper was highly influential because it demonstrated a practical solution to training deep neural networks, which were previously difficult to optimize. By enabling the training of much deeper networks, it paved the way for more advanced and accurate AI models in various visual recognition tasks. The success of residual networks (ResNets) established them as a key architecture in the development of deep learning models, influencing subsequent research and applications in AI.

Residual Learning: A technique where layers in a neural network learn residual functions (the difference between the desired output and the input) instead of trying to learn the full output directly, making training more efficient.
Vanishing Gradients: A problem in training deep neural networks where the gradients (used to update the model) become too small, causing the training to stall.
Layers: Different stages in a neural network where computations are performed, with each layer building on the previous ones.
VGG Nets: A type of convolutional neural network architecture known for its simplicity and depth, developed by the Visual Geometry Group at the University of Oxford.
COCO: Common Objects in Context, a dataset used for object detection, segmentation, and captioning tasks.
ILSVRC: ImageNet Large Scale Visual Recognition Challenge, a prestigious competition in the field of computer vision.
CIFAR-10: A dataset used for training machine learning and computer vision algorithms, consisting of 60,000 32x32 color images in 10 different classes.
Object Detection: A computer vision task where the goal is to identify and locate objects within an image.
Localization: The task of not only detecting objects in an image but also determining their precise locations.
Segmentation: The process of partitioning an image into multiple segments or regions to simplify analysis.

YOLO

You Only Look Once: Unified, Real-Time Object Detection, by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2016)

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

This research presents a new way to detect objects in images quickly and accurately. Unlike previous methods that used complex steps to identify objects, this approach treats the task like a simple prediction problem, using a single, fast neural network. This makes it possible to analyze images in real-time, identifying objects almost instantly. The system can recognize various objects in a wide range of situations, including unfamiliar ones, and is better at not falsely identifying objects in the background.

This paper was influential because it introduced a groundbreaking method for object detection that is both faster and more efficient than previous techniques. It changed how researchers approached the problem, leading to advancements in real-time applications like autonomous driving, security systems, and augmented reality.

YOLO (You Only Look Once): A neural network model for real-time object detection that processes images quickly by treating detection as a simple prediction task.
Regression Problem: In this context, predicting the position of objects and their class probabilities as continuous values.
Bounding Boxes: Rectangles drawn around objects in an image to indicate their position.
Class Probabilities: The likelihood that a detected object belongs to a particular category.
mAP (Mean Average Precision): A measure of the accuracy of object detection systems.
DPM (Deformable Parts Model): A previous method for object detection that uses a set of parts to represent an object.

Thanks for Reading! Feedback appreciated! Especially, if you think I’ve missed any important research.

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

medium.com

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

medium.com

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

medium.com

A History of AI (Part 1)

1950 to 2000

medium.com

Research Papers in Artificial Intelligence

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

Batch Normalization

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

Inception

A History of AI (Part 1)

1950 to 2000 (Perceptrons to Convolutional Neural Networks)

Deep Q

Region-based Convolutional Neural Network

U-Net

Residual Learning

YOLO

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

A History of AI (Part 1)

1950 to 2000

Written by Nuwan I. Senaratna