Research Papers in Artificial Intelligence, explained simply

A Recent History of AI (2022 January)

From ConvNeXt to Robo-Hiker

Published in

On Technology

8 min readJul 17, 2024

In my article series, A History of AI, I reviewed the most important research papers in Artificial Intelligence, from the Perceptron (1950) to ChatGPT (2022).

This is the first article in a new series, where I look at the most important research since. This will be an ongoing series, where I will share updates about new developments.

ConvNeXt

A ConvNet for the 2020s, by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

This research explores how traditional neural networks used for image recognition, known as Convolutional Neural Networks (ConvNets), can be updated and improved to compete with the latest models called Transformers.

By making a series of modifications to the old ConvNet design, the researchers created a new version called ConvNeXt, which not only matches but often surpasses the performance of the newer Transformer models in various visual tasks like object detection and image segmentation.

This study is important because it shows that traditional neural networks still have a lot of potential. By modernizing ConvNets, the researchers demonstrated that these models can achieve impressive results without the complexity and computational demands of newer Transformer models.

This could lead to more efficient and accessible AI technology in the future, benefiting a wide range of applications from medical imaging to autonomous vehicles.

ConvNet or CNN (Convolutional Neural Network): A type of deep learning model specifically designed to process and analyze visual data.
Image Classification: The task of identifying the subject of an image from a set of categories.
Object Detection: The process of identifying and locating objects within an image.
Semantic Segmentation: Dividing an image into parts and labeling each part based on what it represents.
ResNet: A popular type of ConvNet known for its ability to train very deep networks.
Swin Transformers: A variant of Vision Transformers that includes hierarchical structures to improve their performance on various vision tasks.
Inductive Biases: Assumptions made by a model about the data, which help guide its learning process.
ImageNet Top-1 Accuracy: A measure of how often the model correctly identifies the main subject of an image in the ImageNet dataset on its first guess.
COCO Detection: A benchmark for evaluating object detection models using the COCO (Common Objects in Context) dataset.
ADE20K Segmentation: A benchmark for evaluating semantic segmentation models using the ADE20K dataset.

Patches

Patches Are All You Need?, by Asher Trockman, J. Zico Kolter

Researchers have found that a new type of model, which processes images in small regions (or patches), performs better than some of the latest and most complex models. This new model, called ConvMixer, uses simple techniques to mix the information within the patches.

Despite its simplicity, it outperforms other advanced models in recognizing images, proving that sometimes simpler methods can be more effective.

This research is significant because it challenges the assumption that more complex models are always better. By showing that a simpler model can outperform more sophisticated ones, it opens up new possibilities for developing efficient and powerful image recognition systems.

This could lead to faster, more accessible, and more resource-efficient technology in various fields, from medical imaging to autonomous driving.

Vision Transformer (ViT): A model that uses Transformer architecture, originally designed for natural language processing, to process images. It divides images into patches and processes them to recognize patterns.
Self-Attention: A mechanism in neural networks that allows the model to weigh the importance of different parts of the input data when making a decision.
Patch Embeddings: The process of dividing an image into smaller regions (patches) and converting each patch into a feature vector that the model can process.
ConvMixer: A new model proposed by the researchers that processes image patches using simple convolutional operations. It separates the mixing of spatial and channel information and maintains the same size and resolution throughout the network.
MLP-Mixer: A basic model that processes images by mixing information across spatial and channel dimensions using multilayer perceptrons (MLPs), a type of neural network.

LaMDA

LaMDA: Language Models for Dialog Applications, by many authors.

This research introduces a new type of smart language model designed to have conversations with people. These models can understand and generate human-like responses by learning from a vast amount of online conversations and text. While making the model bigger improves its quality, it doesn’t always make it safer or more accurate.

To tackle this, the researchers fine-tuned the model with extra training to avoid harmful or biased responses and to ensure its answers are based on real facts. They also showed how this smart model can be useful in education and content recommendations.

This research is significant because it addresses two major issues in developing conversational AI: safety and accuracy. By improving how the model learns to avoid harmful or biased responses and how it checks facts, this work sets a new standard for creating reliable and trustworthy AI.

Future research can build on these techniques to make even better and safer AI systems for various applications, from customer service to personal assistants.

LaMDA: Stands for Language Models for Dialog Applications. It’s a type of AI model designed to hold conversations.
Transformer-based Neural Language Models: A modern type of AI that processes and generates text using a technique called transformers, which are highly effective for understanding language.
Parameters: In AI, parameters are the parts of the model that are learned from data. More parameters generally mean a more complex and capable model.
Pre-trained: The model is initially trained on a large amount of data to give it a general understanding before it’s fine-tuned for specific tasks.
Safety: Ensuring the AI’s responses are appropriate and do not cause harm or show unfair bias.
Factual Grounding: Making sure the AI’s responses are based on real and verifiable information.
Crowdworker-Annotated Data: Data labelled by people who were hired to provide examples of correct or incorrect responses.
Groundedness Metric: A way to measure how well the AI’s responses are based on actual facts.
External Knowledge Sources: Tools or databases the AI can use to find accurate information, like online searches or calculators.

Chain-of-Thought

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

This research explores how teaching large language models to think through problems step-by-step can greatly improve their ability to solve complex tasks.

By giving these models examples of how to break down problems into smaller steps, their performance on various challenging tasks, like math and logic problems, improves significantly.

The study shows that this method helps the models achieve top scores on difficult tests.

This research is important because it demonstrates a simple yet powerful way to enhance the reasoning abilities of large language models. By improving how these models think through problems, we can expect better performance in a wide range of applications, from natural language processing to decision-making systems.

This approach may inspire new methods and techniques for training and utilizing AI in more sophisticated and human-like ways.

Chain of Thought Prompting: A method where a model is given examples of step-by-step reasoning to follow.
Large Language Models (LLM): AI systems trained on vast amounts of text data to understand and generate human language.
Intermediate Reasoning Steps: Breaking down a problem into smaller, manageable parts to solve it more effectively.
540B-parameter: Refers to a language model with 540 billion parameters, indicating its complexity and capability.
GSM8K Benchmark: A challenging test set of math word problems used to measure the performance of language models.
Finetuned GPT-3: A version of the GPT-3 model that has been specifically trained further for a particular task.
Verifier: A system or method used to check the correctness of answers produced by a language model

BLIP

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

This research introduces a new approach to combining image and text understanding, which helps computers better interpret and generate descriptions of images. Unlike previous methods that either focus on understanding or generating, this new approach excels at both.

It improves performance by using a method to clean up and enhance the data it learns from, leading to better results in tasks like finding images based on text, creating image descriptions, and answering questions about images.

The new method even works well with videos without needing extra training.

This research is significant because it presents a more effective way of teaching computers to understand and generate language related to images. By improving how the computer learns from noisy data, this approach sets a new standard for performance in various vision-language tasks.

It also shows that these techniques can be applied to videos, indicating a broader potential for future applications in areas like video analysis, multimedia content creation, and interactive AI systems. This research paves the way for more integrated and efficient AI systems that can seamlessly handle both visual and textual information.

Vision-Language Pre-training (VLP): A method for training AI to understand and generate information that combines both visual (image) and textual (language) data.
Understanding-Based Tasks: Tasks where the AI interprets and makes sense of images and text, like identifying objects in an image or answering questions about an image.
Generation-Based Tasks: Tasks where the AI creates new content based on images and text, such as writing captions for images or generating descriptive paragraphs.
Noisy Image-Text Pairs: Data collected from the internet that might be inaccurate or irrelevant because it hasn’t been curated or cleaned up.
Bootstrapping: A method of improving data quality by generating new, cleaner data from the existing noisy data.
Image-Text Retrieval: A task where the AI finds images that match a given text description.
Image Captioning: A task where the AI writes descriptions for images.
VQA (Visual Question Answering): A task where the AI answers questions about an image.
CIDEr: A metric used to evaluate the quality of image captions generated by AI.
Zero-Shot: The ability of the AI to perform a task without having been specifically trained for it.

Robo-Hiker

Learning robust perceptive locomotion for quadrupedal robots in the wild, by many authors.

This research focuses on improving the ability of four-legged robots to move quickly and efficiently over difficult terrain by using advanced perception technologies. Normally, robots rely on sensors that detect physical contact with the ground, which slows them down. The new method combines sensors that detect physical contact with visual sensors that “see” the terrain ahead.

This combination allows the robots to move faster and more steadily. The technology was tested in various tough environments, including an hour-long hike in the Alps, where the robot kept up with the speed recommended for human hikers.

This research is a breakthrough in robotic mobility. By successfully combining visual and physical sensing, the technology allows robots to move more efficiently and confidently in challenging environments.

This advancement opens new possibilities for using robots in places that are dangerous or difficult for humans to explore, like disaster sites, remote areas, or complex urban environments. It sets a new standard for robotic exploration and may inspire further innovations in the field of autonomous robotics.

Quadrupedal Robots: Robots with four legs designed to move similarly to animals like dogs or cats.
Exteroceptive Perception: The ability to perceive and understand the environment through external sensors, such as cameras.
Proprioception: The internal sensing mechanism that helps robots understand their own position and movements without relying on external cues.
Attention-Based Recurrent Encoder: A type of neural network that processes sequences of data by focusing on relevant parts, helping to integrate different types of sensory input.
Gait: The pattern of movement or walking, which can be adjusted for stability and efficiency.
End to End Training: A method where the entire system is trained simultaneously, allowing for better integration of different components.
Robustness: The ability to function effectively in various challenging conditions.
Autonomous: The capability of operating independently without human intervention.

Thanks for Reading! Feedback appreciated! Especially, if you think I’ve missed any important research.