History of Generative AI. Paper explained.

Published in

Artificialis

11 min readMar 12, 2023

Generative AI techniques like ChatGPT, DALL-e and Codex can generate digital content such as images, text, and the code. Recent progress in large-scale AI models has improved generative AI’s ability to understand intent and generate more realistic content. This text summarizes the history of generative models and components, recent advances in AI-generated content for text, images, and across modalities, as well as remaining challenges.

In recent years, Artificial Intelligence Generated Content (AIGC) has gained much attention beyond the computer science community, where the whole society is interested in the various content generation products built by large tech companies. Technically, AIGC refers to, given human instructions which could help teach and guide the model to complete the task, using Generative AI algorithms to form a content that satisfies the instruction. This generation process usually comprises two steps: extracting intent information from human instructions and generating content according to the extracted intentions.

HISTORY OF GENERATIVE AI

Generative models have a long history of AI, dating to the 1950s. Early models like Hidden Markov Models and Gaussian Mixture Models generated simple data. Generative models saw major improvements in deep learning. In NLP, traditional sentence generation used N-gram language models, but these struggled with long sentences. Recurrent neural networks and Gated Recurrent Units enabled modeling longer dependencies, handling ~200 tokens. In CV, pre-deep learning image generation used hand-designed features with limited complexity and diversity. Generative Adversarial Networks and Variational Autoencoders enabled impressive image generation. Advances in generative models followed different paths but converged with transformers, introduced for NLP in 2017. Transformers dominate many generative models across domains. In NLP, large language models like BERT and GPT use transformers. In CV, Vision Transformers and Swin Transformers combine transformers and visual components for images. Transformers also enabled multimodal models like CLIP, a joint vision-language model pre-trained on massive text and image data. CLIP can generate images from text prompts. Transformers revolutionized AI generation, enabling large-scale training.

FOUNDATIONS FOR AIGC:

Transformer

Transformer models are the foundation for many state-of-the-art NLP systems. Proposed to address limitations of RNNs, Transformers use self-attention to relate contexts across variable-length sequences. Transformers have an encoder to generate context vectors from input sequences and a decoder to produce outputs from context vectors. Each encoder/decoder layer uses multi-head attention, which weights input tokens by relevance, enabling modeling of long-range dependencies.

Pre-trained Transformer models are classified into two types: autoregressive language modeling and masked language modeling, which predicts masked tokens from context. BERT and RoBERTa use masked language modeling. Transformers have become dominant in NLP because of their learning ability and parallelism.

Reinforcement Learning from Human Feedback

The process of reinforcement learning with human feedback (RLHF) includes three main steps:

First, we train a general language model on large datasets to get an initial model.
Then, we train a reward model to encode how humans evaluate different responses to the same prompt. We show humans multiple possible responses and have them compare them in pairs. We use those comparisons to assign a score to each response.
Finally, we further train the language model using reinforcement learning to maximize the reward model’s scores. We use a method called proximal policy optimization (PPO) to stabilize the training. At each step, we also include a penalty term to prevent the model from giving strange responses just to trick the reward model. The total reward at each step equals the reward model’s score minus a penalty term based on how different the model’s response is from the initial model’s response.

Computing

Recent hardware advancements have sped up AI model training. Powerful GPUs and TPUs can train large neural networks in days rather than weeks. For example, the NVIDIA A100 GPU trains the BERT-large model 7x faster than the V100 GPU and 11x faster than the T4. Google’s TPUs offer even higher performance than A100 GPUs. This progress has significantly increased model training efficiency and enabled larger, more complex AI models.

Cloud computing services have enabled training large AI models that were previously impossible. Access to powerful computing resources in the cloud allows researchers to spin up clusters of GPUs and TPUs as needed, facilitating more complex and accurate models that unlock new AI capabilities.

GENERATIVE AI

Generative language models (GLMs) are a type of NLP models that are trained to generate readable human language based on patterns and structures in input data that they have been exposed to. These models can be used for a wide range of NLP tasks such as dialogue systems, translation and question answering. Encoder-decoder models can leverage both context information and autoregressive properties to improve performance across a variety of tasks.

The GPT language model uses autoregressive decoder with self-attention mechanisms to generate coherent text based on previous words. GPT-2 and GPT-3 improve on the model by scaling up the parameters and using diverse datasets, while other models such as Gopher and BLOOM optimize the normalization and attention mechanisms to achieve better performance.

Encoder-decoder models are widely used in natural language processing, with the Text-to-Text Transfer Transformer (T5) being a popular example. T5 uses a “text-to-text” approach, transforming input and output data into standardized text format for pre-training on various NLP tasks. Other models, such as Switch Transformer, ExT5, BART, and HTLM, have been developed to improve upon T5 by incorporating different techniques, such as MoE routing algorithms, extended pre-training on diverse domains, and blending bidirectional and autoregressive properties for generation tasks.

Vision Generative Models

Generative Adversarial Networks (GAN) are used to generate new data by learning the distribution of real examples, using a generator and discriminator. The structure of the generator and discriminator significantly affect GAN’s stability and performance.

Various GAN models such as LAPGAN, DCGAN, Progressive GAN, SAGAN, BigGAN, StyleGAN, D2GAN, GMAN, MGAN, MAD-GAN, and CoGAN have been proposed to improve GAN’s ability to generate high-quality and diverse images.

Variational Autoencoders (VAE) are generative models that attempt to reflect data to a probabilistic distribution and learn reconstruction that is close to its original input.

The Generative Diffusion Model (GDM) is a cutting-edge class of generative models based on probability, which demonstrates state-of-the-art results in the field of computer vision.

Multimodal Models

Multimodal generation is an important aspect of AI and involves learning a model that generates multiple types of data by understanding the connections and interactions between them. It can be challenging to learn the complex representation space of multimodal data compared to unimodal data, but recent advancements in modality-specific architectures have helped address this issue. Multimodal generation models are often used in real-world applications, and this section discusses the state-of-the-art models for various tasks, such as vision language generation, text audio generation, text graph generation, and text code generation.

Concatenated encoders combine embeddings from different types of data, while cross-aligned encoders match up similar information from different data.

VisualBERT, VL-BERT, and UNITER are vision language models that use pre-training objectives such as masked language modeling, masked ROI classification, and image text matching prediction to learn informative contextualized embeddings. SimVLM is a simplified version of these models using ViT as both the text and image encoder and achieved state-of-the-art performance on multiple vision language tasks.

Cross-aligned encoders are another way to learn joint representation spaces by looking at interactions between different modalities. LXMERT is an example of this type of encoder, using Transformers to extract image and text features and a multimodal cross-attention module to generate visual, language, and multimodal embeddings.

To-text decoders are divided into jointly trained models and frozen models. Jointly trained models require complete cross-modal training to align the two modalities during pre-training, while frozen models freeze the language model and train only the image encoder.

To-image decoders are divided into GAN-based and diffusion-based methods, with GAN-based methods using a discriminator and a generator that accepts the text embedding and noise vector to generate output. StackGAN and AttnGAN are examples of GAN-based models that use a simple text encoder during instruction learning, while StyleCLIP uses contrastive learning to align text and image features. The success of prompting and in-context learning in NLP has led to increased attention towards multimodal in-context learning methods. Flamingo involves a frozen vision encoder and a frozen language encoder to get vision language representations, while VL dialogue with frozen language models generates interleaved multimodal data.

Text-audio multimodal processing has grown in recent years, with models focusing on synthesis or recognition tasks. However, text-audio generation involves creating novel audio or text using multimodal models and differs from synthesis or recognition tasks. AdaSpeech and Lombard are proposed models for voice customization and generating highly intelligible speech, respectively. Cross-lingual generation is also an influential work for transferring voices across languages. The goal of this work is to focus on text-audio generation rather than synthesis or recognition tasks.

Text-Music Generation is a field where models are trained to create music from textual inputs, or generate descriptions and captions for music. Several works use cross-modal learning to find similarities between audio and lyrics or combine multiple types of information related to music. One example is JTAV, which fuses textual, acoustic, and visual information using cross-modal fusion and attentive pooling techniques. Another example is MusCaps, a music audio captioning model that generates descriptions of music audio content by processing audio-text inputs through a multimodal encoder and leveraging audio data pre-training to obtain effective musical feature representations.

Text-graph generation is a crucial area that can enhance NLP systems by converting natural language text into structured knowledge graphs. This can help machines work with well-organized and compressed content, and there are many works focused on extracting knowledge graphs from text to assist in text generation.. Text-graph generation can also benefit computer-aided drug design by bridging molecule graphs with language descriptions.

Text-to-Molecule: Text2Mol is a system that retrieves molecule graphs based on language description using a BERT-based text encoder and a molecule encoder, while MolT5 proposes a self-supervised learning framework for de-novo molecule generation and captioning, but its string-based representation may result in structural information loss.

Text-code generation uses LLMs to generate programming code from natural language descriptions, but strategies are required to capture the mutual dependencies between NL and PL during semantic space alignment due to their inherent differences. Additionally, text-code models need to be able to understand PL’s rich structural information and syntax, and be multi-lingual for better generalization.

APPLICATIONS

ChatBot.

A chatbot is a program that mimics human conversation through text-based interfaces, leveraging language models to interpret and respond to user inquiries in a conversational manner, with applications ranging from customer support to answering common queries.

Xiaoice, a chatbot developed by Microsoft, uses advanced natural language processing, machine learning, and knowledge representation techniques to mimic human conversation and express empathy. Google’s Meena chatbot, trained on social media conversations, achieved state-of-the-art results in 2020, while Microsoft’s latest version of Bing incorporates ChatGPT to enable users to ask open-domain or conditioned questions and get results through conversation, paving the way for future chatbot development.

Art.

AI art generation involves using computer algorithms to create new pieces of art, with techniques like machine learning and large datasets of existing artwork to mimic famous artists or explore new styles. Companies like OpenAI and Stability.ai have launched art generation products, including the DALL-E and DreamStudio series, which use diffusion-based models for image generation based on text input. Google’s Imagen also uses diffusion for image editing and generation, outperforming other models in a study evaluating AI-generated image quality.

Music.

Deep music generation is the use of AI to create new music, usually by using a piano roll to specify timing, pitch, velocity, and instrument for each note. AIVA and Jukebox are examples of AI music generators that can create music in multiple styles and with singing, respectively, and have gained recognition for their musical quality and capacity for artistic conditioning.

Code.

AI-based programming systems aim to complete tasks such as code completion, program repair, and natural language to code generation. LLMs have helped advance this field, with CodeGPT being a notable example that generates code based on a vast amount of source code data. CodeParrot is a programming learning platform that uses scaffolding to help students gradually build their coding skills, while Codex is designed to generate complete coding programs from scratch and can be adapted to multiple programming languages.

Education.

AIGC (Artificial Intelligence in General Education) can enhance personalized education by utilizing various data types such as tutorial videos and academic papers. Google Research developed Minerva, a model based on PaLM general language models and specialized datasets, capable of solving college-level quantitative tasks in subjects such as algebra, physics, and biology. Minerva uses techniques like few-shot prompting and scratchpad prompting to achieve state-of-the-art performance in reasoning tasks. Although it is not yet as good as humans, with continued improvement, AIGC could offer affordable personalized math tutors. Skillful Craftsman Education Technology is developing a class bot product powered by AIGC, which will have auto curriculum, AI tutors, and self-adaptive learning for online education and will be shipped in 2023.

Prompt Learning

Prompt learning is a new concept in the field of pre-trained large language models, which involves finding a template that directly predicts the probability of a prediction based on the prompt. This method can save time and effort by allowing language models to be pre-trained on large amounts of raw text data and adapted to new domains without additional tuning.

Prompt engineering is an important step in this process, which can involve either discrete or continuous prompts. Answer engineering is another step that involves mapping the generated answer to the ground truth space.

In-context learning is a subset of prompt learning that involves adding a few input-label demonstration pairs and instructions to the prompt, which has been shown to be highly effective in improving language models’ performance.

Security and Privacy

Although AI tools like ChatGPT can create content that sounds reasonable, it may not always be accurate in terms of facts. This can lead to the spread of false information and misinformation. There have been examples of AI-generated content being used to spread false narratives and propaganda. To address this issue, researchers have proposed metrics and models to measure the factual accuracy of AI-generated content and train models to be more truthful. Additionally, it is important for AI-generated content to be safe, unbiased, and non-toxic. Research has been conducted to improve the safety and ethics of language models, such as using human feedback to fine-tune models and align them with human preferences.

CONCLUSION

This survey talks about the history and recent improvements in generative AI, which includes models that can create things like text, images, and sounds. It also discusses how these models are used, and addresses concerns about how trustworthy and responsible they are, challenges and future directions of generative AI, with the goal of helping readers understand this field better.

RESOURCES: