Research Papers in Artificial Intelligence

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

Published in

On Technology

11 min read2 days ago

This article is the 5th in a series of articles where I present a history of Artificial Intelligence, by reviewing the most important research papers in the field.

#ArtificialIntelligence #MachineLearning #DeepLearning #AIResearch #History

Transformers

Attention is All you Need, by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin (2017)

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attention mechanisms. We propose a novel, simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less time to train. Our single model with 165 million parameters, achieves 27.5 BLEU on English-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previous single state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

This research introduces a new, simpler way to improve language translation using only an attention mechanism, eliminating the need for complex neural network structures previously used. This method not only produces better translations but also speeds up the training process significantly.

This paper was groundbreaking in the AI field because it introduced the transformer model, which revolutionized natural language processing by making models more efficient and effective. The use of attention mechanisms allowed for better handling of long-range dependencies in text, leading to significant improvements in tasks like translation, summarization, and question-answering.

Sequence transduction: Converting one sequence of symbols to another, like translating a sentence from English to German.
Recurrent neural networks (RNNs): A type of neural network where connections between nodes form a directed graph along a temporal sequence, allowing them to use their internal state (memory) to process sequences of inputs.
Convolutional neural networks (CNNs): A type of deep neural network often used for analyzing visual imagery, which uses convolutional layers that apply a filter to an input to create a feature map.
Encoder and decoder configuration: A framework used in neural networks where the encoder processes the input into a fixed-size vector, and the decoder processes this vector to produce the output sequence.
Attention mechanism: A technique that allows the model to focus on different parts of the input sequence when generating each part of the output sequence, improving performance on tasks like translation.
Parallelizable: The ability to run multiple calculations or processes simultaneously, which speeds up computation.
BLEU (Bilingual Evaluation Understudy): A metric for evaluating the quality of text which has been machine-translated from one language to another. It compares the machine’s output to that of a human.
Transformer model: A type of deep learning model introduced by this paper that relies entirely on self-attention mechanisms to process input sequences, improving efficiency and effectiveness in tasks involving sequential data.

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2018)

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

This research introduces a new model called BERT, which helps computers understand and process human language better. Unlike previous models, BERT reads text in both directions, making it more accurate in understanding the context of words. This model can be fine-tuned for various tasks like answering questions or understanding sentences without needing major changes, achieving top results in several language processing challenges.

This paper was influential because it introduced a novel way of training language models, allowing them to understand context from both directions in a sentence. This bidirectional approach significantly improved the performance of models on many natural language processing tasks. BERT set new benchmarks in the field, pushing the boundaries of what was possible with machine understanding of language and influencing many subsequent developments in AI.

BERT (Bidirectional Encoder Representations from Transformers): A language model that processes text in both directions to understand context better.
Pre-train: The process of training a model on a large amount of data before fine-tuning it for specific tasks.
Fine-tune: Adjusting a pre-trained model with additional training for specific tasks.
GLUE (General Language Understanding Evaluation): A benchmark for evaluating the performance of language models on various natural language understanding tasks.
MultiNLI (Multi-Genre Natural Language Inference): A dataset used to test a model’s ability to understand and reason about the relationship between sentence pairs.
SQuAD (Stanford Question Answering Dataset): A dataset used to evaluate a model’s ability to answer questions based on a given text.
F1 Score: A measure of a model’s accuracy, considering both precision (correctly identified items) and recall (items that should be identified).

GPT

Language Models are Few-Shot Learners, by Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020)

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

Researchers developed a large language model, GPT-3, which significantly improves the ability of AI to understand and perform tasks with minimal instructions. GPT-3, with 175 billion parameters, can handle tasks like translation and answering questions just by being given examples through text, without needing to be specially trained for each task. While it excels in many areas, it still struggles with some tasks and faces challenges from its training on extensive internet data.

This paper was influential because it demonstrated that increasing the size of language models could lead to significant improvements in performance across a variety of tasks without requiring specialized training. This finding shifted the focus towards scaling models to achieve better results and highlighted the potential for more versatile AI applications.

Language Model: A type of AI that predicts and generates text based on the input it receives.
Few-Shot Learning: The ability of a model to learn and perform a task using only a few examples.
Parameters: Variables in the model that are adjusted during training to help the model make accurate predictions.
Autoregressive: A type of model that generates text one word at a time, using each previous word to predict the next.
Gradient Updates: Adjustments made to the parameters of the model during training to improve its performance.
Fine-Tuning: Training a pre-trained model on a specific task to improve its performance on that task.
NLP (Natural Language Processing): A field of AI focused on the interaction between computers and human language.
Translation: Converting text from one language to another.
Question-Answering: An AI task where the model provides answers to questions based on given information.
Cloze Tasks: Tasks where the model fills in missing words in a sentence or passage.
Web Corpora: Large collections of text data sourced from the internet used to train AI models.

ViT

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby (2020)

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

This research shows that a type of AI model called a “Transformer,” which has been very successful in understanding and generating human language, can also be used to analyze images. Traditionally, image recognition relied on a different type of model known as convolutional neural networks (CNNs). However, the researchers demonstrated that Transformers could be directly applied to images divided into small patches and still achieve excellent results, sometimes even better than CNNs. Moreover, Transformers require less computing power to train, making them more efficient.

This paper was influential because it challenged the long-standing dominance of convolutional neural networks (CNNs) in computer vision. By showing that Transformers, a model architecture successful in natural language processing, could be effectively adapted for image recognition, it opened up new possibilities for AI research and applications. It demonstrated the versatility of Transformers and spurred further research into their use across different types of data and tasks.

Image Patches: Small, non-overlapping sections of an image that can be processed individually by a model.
Benchmarks (ImageNet, CIFAR-100, VTAB): Standard datasets used to evaluate and compare the performance of different image recognition models.
Vision Transformer (ViT): A type of Transformer model specifically adapted for image recognition tasks.
Computational Resources: The amount of computing power, memory, and time required to train and run a machine learning model.

AlphaFold

Highly accurate protein structure prediction with AlphaFold, by John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Demis Hassabis (2021)

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence — the structure prediction component of the ‘protein folding problem’ — has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

This research presents AlphaFold, a groundbreaking AI tool that predicts the 3D structure of proteins based solely on their amino acid sequences. Understanding protein structures is crucial for comprehending their functions and for drug discovery. Traditionally, determining these structures has been a slow and labor-intensive process. AlphaFold overcomes this by using advanced machine learning techniques to achieve atomic-level accuracy in its predictions, even for proteins with no previously known structures. This innovation promises to accelerate biological research and medical advancements significantly.

This paper was influential because it showcased the power of AI in solving a complex, long-standing scientific problem. AlphaFold’s ability to predict protein structures with high accuracy demonstrated how deep learning could revolutionize fields beyond traditional computing, particularly in biology and medicine. This breakthrough illustrated the potential of AI to make significant contributions to scientific research and practical applications, inspiring further interdisciplinary advancements.

Proteins: Large, complex molecules made up of amino acids that perform many vital functions in living organisms.
Amino Acid Sequence: The order of amino acids in a protein, which determines its structure and function.
Protein Folding Problem: The challenge of predicting a protein’s 3D structure from its amino acid sequence.
Atomic Accuracy: The precision of predicting the exact position of each atom in a protein molecule.
CASP (Critical Assessment of protein Structure Prediction): A community-wide experiment to assess methods for predicting protein structures.
Multi-Sequence Alignments: A method to identify similarities and differences in protein sequences, helping to infer structural and functional information.

ChatGPT

Introducing ChatGPT, by OpenAI (2022)

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.
ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.
We are excited to introduce ChatGPT to get users’ feedback and learn about its strengths and weaknesses. During the research preview, usage of ChatGPT is free. Try it now at chatgpt.com(opens in a new window).

OpenAI has developed a conversational AI model called ChatGPT. This model can engage in dialogue, answer questions, admit mistakes, and handle various requests in a natural, conversational way. It is designed to be interactive, making it easier for users to get detailed responses and engage in meaningful conversations. The model is available for free during its research preview, and OpenAI is eager to gather user feedback to improve its performance.

This paper was influential because it introduced a highly capable conversational AI that can understand and generate human-like text, significantly advancing the field of natural language processing (NLP). The interactive format of ChatGPT, along with its ability to handle complex queries and engage in meaningful dialogues, demonstrated a new level of AI sophistication and usability. It also highlighted the potential of AI to assist in various applications, from customer service to education, thereby broadening the scope of AI utility.

Conversational AI: Artificial Intelligence designed to interact with humans in a natural, conversational manner, understanding and generating human language.
Dialogue Format: A method of interaction where the AI engages in back-and-forth conversation with the user, making the exchange more interactive and natural.
InstructGPT: A sibling model to ChatGPT, specifically trained to follow detailed instructions provided in prompts and deliver precise responses.
Research Preview: A phase during which a new technology or model is made available to users for testing and feedback, allowing developers to gather data and improve the system.
Natural Language Processing (NLP): A field of AI focused on the interaction between computers and humans using natural language, enabling machines to understand and respond to text or speech inputs.

Thanks for Reading! Feedback appreciated! Especially, if you think I’ve missed any important research.

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

medium.com

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

medium.com

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

medium.com

A History of AI (Part 1)

1950 to 2000

medium.com

Research Papers in Artificial Intelligence

A History of AI (Part 5)

2017 to 2022 (Transformers to ChatGPT)

Transformers

BERT

GPT

ViT

AlphaFold

ChatGPT

A History of AI (Part 4)

2015 to 2016 (Batch Normalization to YOLO)

A History of AI (Part 3)

2010 to 2014 (AlexNet to Adam)

A History of AI (Part 2)

2000 to 2010 (Random Forests to ImageNet)

A History of AI (Part 1)

1950 to 2000

Written by Nuwan I. Senaratna