Unraveling the Magic: A Deep Dive into Generative AI

NeuroCortex.AI
19 min readMar 6, 2024

--

Part 1 here: Unleashing Creativity: An Introduction to Generative AI | by NeuroCortex.AI | Feb, 2024 | Medium

I hope you remember our friend Valerie who found out how useful Gen AI is to her creative process. This was life altering experience for her and she started talking about it to her network of friends and peers. One of the people she talked to was Alex who is a programmer at a Deep tech startup who agreed that this was game changer for programmers as well.

He was a talented coder, known for his expertise in software design and writing automation suites. Alex had always been fascinated by the potential of AI to revolutionize the world, but he had never imagined what he would find when he stumbled upon the concept of Gen AI.

Next gen human-machine symbiosis as imagined by Generative AI based text to image models

It all started one evening when Alex attended a tech conference in the city. Among the myriad of presentations and demonstrations, one topic stood out to him: Generative AI. The speakers described it as the next leap in artificial intelligence, capable of creating virtual entities with unprecedented complexity and realism.

Intrigued, Alex began to delve deeper into the world of Gen AI. He spent countless hours studying research papers, attending workshops, and experimenting with different Gen AI algorithms. The more he learned, the more fascinated he became. This is the story of his in-depth learning.

Now before proceeding forward our esteemed readers would need to have pre-requisite knowledge of Machine learning, Deep learning and some implementation skills in python.

Some resources for readers to start with:

  1. Ahead of AI | Sebastian Raschka, PhD | Substack
  2. Jay Alammar — Visualizing machine learning one concept at a time. (jalammar.github.io)
  3. Home — colah’s blog — Christopher Olah
  4. Lil’Log (lilianweng.github.io) — Lilian Weng
  5. Blog (huyenchip.com) — Chip Huyen
  6. jeremy.fast.ai — Jeremy Howard
  7. Cezanne Camacho — Machine and deep learning educator.
  8. ruder.io — Sebastian Ruder
  9. 12 Best Machine Learning Blogs to Read in 2021 (bloggingfordevs.com)
  10. Best Machine Learning Blogs to Follow in 2022 | Towards AI
  11. What Are the Best, Regularly Updated Machine Learning Blogs or Resources Available? (neptune.ai)
  12. The Best Machine Learning Blogs and Resources (stxnext.com)
  13. https://peterbloem.nl/blog/transformers — Peter Bloem

Let’s see what Alex learned as he explored this vast field

Fundamentals of Generative AI

Various Gen AI architectures and their flow of information processing

At its core, generative AI utilizes deep learning to understand patterns within vast datasets. Unlike traditional AI, which relies on explicit programming, generative AI learns from data to make predictions and generate new content. The architecture typically involves neural networks, the building blocks of deep learning.

Generative AI is a field within artificial intelligence (AI) that focuses on creating systems capable of generating new content, such as images, text, music, and more, that is similar to data it has been trained on. These systems learn the underlying patterns and structures of the data they’re trained on and can then generate new, original content that follows those patterns.

Here are some fundamental concepts and techniques in generative AI:

Generative Models: These are the algorithms or architectures used to generate new data. Some common types include:

  • Generative Adversarial Networks (GANs): GANs consist of two neural networks, a generator and a discriminator, which are trained simultaneously. The generator learns to produce data that is indistinguishable from real data, while the discriminator learns to differentiate between real and fake data. Through this adversarial training process, GANs can generate realistic data.
  • Variational Autoencoders (VAEs): VAEs are a type of neural network architecture that learns to encode input data into a lower-dimensional latent space and then decode it back into the original data space. By sampling from the latent space, VAEs can generate new data points.
  • Autoregressive Models: These models generate data sequentially, where each element of the sequence is dependent on previous elements. Examples include autoregressive models like PixelCNN and WaveNet.

Let’s expand on the types of Gen AI architectures which have changed the landscape of modern AI

Generative Adversarial Networks (GANs)

GAN architecture is an impressive use of unsupervised leanring. The generator learns by optimising loss provided by discriminative network which tries to maximise the loss of generative network.

GANs are a cornerstone of generative AI architecture. Introduced by Ian Goodfellow and his colleagues in 2014, GANs consist of two neural networks — a generator and a discriminator — engaged in a continuous feedback loop. The generator creates content, and the discriminator evaluates its authenticity. This adversarial process refines the generator’s ability to produce increasingly realistic outputs.

Here’s how GANs work:

  1. Generator: The generator takes random noise or a latent vector as input and generates data samples (e.g., images) from this noise. Initially, the generator produces random outputs, but as training progresses, it learns to generate data that becomes increasingly similar to the real data from the training set.
  2. Discriminator: The discriminator is a binary classifier that learns to distinguish between real data samples from the training set and fake data samples generated by the generator. It is trained on a combination of real and fake data samples and learns to assign high probabilities to real samples and low probabilities to fake samples.
  3. Adversarial Training: During training, the generator and discriminator are trained iteratively in a minimax game framework. The generator tries to produce data samples that are indistinguishable from real samples, while the discriminator aims to correctly classify real and fake samples. The objective of the generator is to minimize the discriminator’s ability to distinguish between real and fake samples, while the objective of the discriminator is to maximize its accuracy in discriminating between them.
  4. Loss Functions: The generator and discriminator are optimized using different loss functions. The generator’s loss is typically the negative of the discriminator’s output when applied to fake samples, aiming to fool the discriminator. The discriminator’s loss combines losses from correctly classifying real and fake samples.
  5. Training Stability: Training GANs can be challenging due to issues such as mode collapse (where the generator produces limited varieties of samples) and vanishing gradients. Various techniques have been proposed to address these issues, including different architectures (e.g., Deep Convolutional GANs — DCGANs), regularization methods, and alternative loss functions.

Feel free to have a look at the following implementation of Deep Convolution GAN

GANs have been successfully applied to various tasks, including image generation, image-to-image translation, super-resolution, style transfer, text-to-image synthesis, and more. They have also inspired numerous variants and extensions, such as conditional GANs, Wasserstein GANs, and progressive GANs, which aim to improve training stability and generate higher-quality samples. GANs continue to be an active area of research with applications in art, entertainment, healthcare, and many other domains.

Autoencoders

Autoencoders are able to reconstruct the original signal with a latent representation

Autoencoders are another essential component, especially in unsupervised learning scenarios. These networks compress input data into a latent space representation and then reconstruct it. This architecture is widely used for tasks such as image denoising and feature extraction.

Autoencoders are a class of neural network architectures used for unsupervised learning of efficient data representations. They work by compressing the input data into a lower-dimensional latent space and then reconstructing the original input data from this representation. Autoencoders consist of two main components: an encoder and a decoder.

Here’s how autoencoders work:

  1. Encoder: The encoder network takes the input data and maps it to a lower-dimensional latent space representation. This latent representation captures the important features or patterns present in the input data. The encoder typically consists of several layers of neurons, gradually reducing the dimensionality of the input until it reaches the desired latent space dimension.
  2. Latent Space: The latent space is a compact and dense representation of the input data. It contains encoded information about the input, such as features or attributes that are relevant for reconstruction. The dimensionality of the latent space is a hyperparameter that needs to be chosen based on the complexity of the data and the desired level of compression.
  3. Decoder: The decoder network takes the latent representation produced by the encoder and reconstructs the original input data. The decoder’s goal is to generate output data that closely resembles the input data while minimizing the reconstruction error. Like the encoder, the decoder typically consists of several layers that gradually upsample the latent representation until it matches the dimensions of the original input.
  4. Training: Autoencoders are trained using a reconstruction loss function, which measures the difference between the input data and the reconstructed output. The most commonly used loss function is the mean squared error (MSE) loss, although other loss functions like binary cross-entropy may be used depending on the nature of the input data. The network’s parameters (weights and biases) are optimized to minimize the reconstruction loss using gradient-based optimization algorithms such as stochastic gradient descent (SGD) or Adam.

Autoencoders have several applications in machine learning and data compression:

  • Dimensionality Reduction: Autoencoders can learn compact representations of high-dimensional data, which can be useful for tasks like visualization and feature extraction.
  • Data Denoising: By training an autoencoder to reconstruct clean data from noisy inputs, it can be used for denoising and removing noise from datasets.
  • Anomaly Detection: Autoencoders can learn to reconstruct normal data patterns and are sensitive to deviations from these patterns. They can be used for anomaly detection by identifying data instances that cannot be accurately reconstructed.
  • Feature Learning: Autoencoders can be pre-trained on unlabeled data to learn useful features that can then be transferred to other supervised learning tasks, improving performance with limited labeled data.

Overall, autoencoders are versatile neural network architectures that can learn compact representations of data without the need for labeled training examples. They have been successfully applied in various domains, including computer vision, natural language processing, and signal processing.

  • Applications: VAEs have been used for tasks such as image generation (e.g., generating new faces), image inpainting (filling in missing parts of images), and generating molecular structures in drug discovery.

Variational Autoencoders (VAEs)

In VAEs the network tries to learn the distribution of various features and then tries to regenerate the input by random sampling and reparametarisation

A variational autoencoder (VAE) is a type of artificial neural network used in unsupervised learning. It is a generative model that learns to represent high-dimensional data in a lower-dimensional space, typically called the latent space or latent variable space. VAEs are particularly useful for tasks such as generating new data points similar to those in the training set.

Here’s how a variational autoencoder works:

  1. Encoder: The encoder network takes an input data point and maps it to a probability distribution in the latent space. Instead of directly outputting a point in the latent space, the encoder outputs the parameters of the probability distribution (mean and variance) representing where the input data point is likely to lie in the latent space.
  2. Sampling: Once the encoder has produced the parameters of the probability distribution, a point in the latent space is sampled from this distribution. This sampling step introduces stochastic variation, allowing the VAE to generate diverse outputs.
  3. Decoder: The sampled point from the latent space is then passed through the decoder network, which attempts to reconstruct the original input data point. The decoder network essentially tries to invert the encoding process.
  4. Loss Function: The parameters of the encoder and decoder networks are trained jointly by minimizing a loss function. The loss function typically consists of two parts: a reconstruction loss, which measures how well the decoder can reconstruct the input data, and a regularization term called the KL divergence, which encourages the learned latent space to follow a specific prior distribution, usually a Gaussian distribution. The KL divergence term ensures that the latent space remains well-structured and continuous.

By training the VAE on a dataset, it learns to generate new data points by sampling from the learned latent space. VAEs have been successfully applied in various domains, including image generation, text generation, and molecular design. They offer a powerful framework for learning rich representations of complex data distributions.

Have a look at the implementation of Convolutional VAE with MNIST data

Recurrent Neural Networks (RNNs) & Long Short-Term Memory (LSTM)

The core architectural design of different types of RNN and their information flow

For sequential data generation, such as text or music, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks are employed. These architectures enable the model to maintain context and generate coherent sequences.

Recurrent Neural Networks (RNNs) are a class of neural network architectures designed to handle sequential data by maintaining an internal state or memory. Unlike traditional feedforward neural networks, which process each input independently, RNNs can capture temporal dependencies within sequences, making them well-suited for tasks such as time series prediction, natural language processing, and speech recognition.

RNNs have several variants and extensions, including Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which address some of the limitations of basic RNNs, such as the vanishing gradient problem. LSTM networks, for example, introduce additional gating mechanisms to control the flow of information within the network and better capture long-term dependencies in the data.

RNNs have been successfully applied in various domains, including:

  • Natural Language Processing (NLP) tasks such as language modeling, machine translation, and sentiment analysis.
  • Time series prediction tasks such as stock price forecasting, weather forecasting, and speech recognition.
  • Sequential data generation tasks such as music composition, text generation, and handwriting generation.

Despite their effectiveness, RNNs have limitations, such as difficulty in capturing long-range dependencies and computational inefficiency for long sequences. Researchers continue to explore new architectures and training techniques to address these challenges and improve the performance of RNNs on various tasks.

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem and capture long-term dependencies in sequential data. LSTMs were introduced by Hochreiter and Schmidhuber in 1997 and have since become a popular choice for tasks involving sequential data, such as natural language processing, time series prediction, and speech recognition.

LSTMs are trained using backpropagation through time (BPTT), similar to traditional RNNs. They are particularly effective for tasks that require modeling long-range dependencies in sequential data, thanks to their ability to maintain and update a long-term memory state. LSTMs have been successfully applied in various applications, including machine translation, speech recognition, handwriting recognition, and sentiment analysis.

Despite their effectiveness, LSTMs have some limitations, such as computational complexity and difficulty in capturing certain types of dependencies. However, they remain one of the most widely used architectures for sequential data tasks and have inspired further research into more advanced variants and alternatives, such as Gated Recurrent Units (GRUs) and Transformer models.

Seq 2 seq models and the issue of attention

In recent years, the field of natural language processing (NLP) has witnessed remarkable advancements, and one of the prominent breakthroughs is the development of Sequence-to-Sequence (Seq2Seq) models. Seq2Seq models have revolutionized various NLP tasks by enabling the transformation of sequences from one domain to another, offering solutions to machine translation, text summarization, speech recognition, and more.

This visualization shows translation task and how different tokens should be given importance while translation

At its core, a Seq2Seq model consists of two main components: an encoder and a decoder. These components work in tandem to process input sequences and generate corresponding output sequences. The encoder takes the input sequence and compresses it into a fixed-size vector, often referred to as the context vector or the thought vector. This vector contains the salient information from the input sequence and serves as the initial state for the decoder.

The decoder then generates the output sequence by predicting one token at a time. It takes the context vector as its initial state and employs techniques like attention mechanisms to focus on different parts of the input sequence while generating each output token. The attention mechanism enables the model to capture the contextual relationship between the input and output sequences, improving the quality of the generated translations or summaries.

Classic encoder-decoder stack utilized in Seq2seq models

But this scenario gives rise to another problem as in which context vector to be used. Why we are using only final context vector. Because of its sequential design its not easy to parallelize this while training/inference. Thus lies in the need of Attention mechanism.

What is Attention?

When we think about the English word “Attention”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data.

In broad terms, Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence:

  1. Between the input and output elements (General Attention)
  2. Within the input elements (Self-Attention)

Let me give you an example of how Attention works in a translation task. Say we have the sentence “How was your day”, which we would like to translate to the French version — “Comment se passe ta journée”. What the Attention component of the network will do for each word in the output sentence is map the important and relevant words from the input sentence and assign higher weights to these words, enhancing the accuracy of the output prediction.

Weights are assigned to input words at each step of translation

Here’s how attention mechanisms work:

  1. Global Context: Traditional neural network architectures, such as feed-forward or recurrent networks, process input data sequentially or with fixed-size receptive fields. However, in many tasks, different parts of the input may have varying degrees of importance or relevance to the output. Attention mechanisms allow models to dynamically adjust the focus of their computations based on the input data.
  2. Attention Scores: At its core, an attention mechanism computes attention scores that indicate the relevance of each element in the input to the current context. These scores are often computed using a compatibility function that measures the similarity between the current context and each element of the input.
  3. Softmax Normalization: The attention scores are typically normalized using a softmax function to obtain attention weights. These weights represent the importance or relevance of each input element relative to the current context and sum to one. Thus, the attention weights act as a probability distribution over the input elements.
  4. Weighted Sum: Finally, the attention weights are used to compute a weighted sum of the input elements. This weighted sum serves as a context vector that captures the relevant information from the input for the current step of computation. The context vector is then used by the model to make predictions or generate outputs.
Different Architectural design of Attention models as proposed by Bahdanau & Luong respectively. These models are precursors to the latest Transformer models

Attention mechanisms play a crucial role in enhancing the performance of generative AI models. By allowing the model to focus on specific parts of the input sequence, attention mechanisms improve the quality of generated outputs.

But there is still a problem remaining with this kind of seq 2 seq models even when combined with attention mechanism are not easy to parallelize. They take a long time to train and this could be prohibitive to use. Thankfully researchers at Google figured out technique of multihead self-attention which solved the problem and pushed attention mechanism to be used by every large language model. The complete architecture is called Transformer.

Visualization showcasing the information flow through a modern day transformer; use of multi head attention to perform language translation task

Transformer architecture, popularized by models like GPT (Generative Pre-trained Transformer), has gained prominence in natural language processing tasks. It excels in capturing long-range dependencies and has been instrumental in the development of language models capable of generating coherent and contextually relevant text.

The Transformer architecture is a deep learning model introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017. It has gained widespread popularity for its effectiveness in natural language processing tasks, particularly in machine translation, text generation, and other sequence-to-sequence tasks. The Transformer architecture is based on the self-attention mechanism, which allows it to capture global dependencies in input sequences without relying on recurrent or convolutional layers.

Here are the key components of the Transformer architecture:

  1. Self-Attention Mechanism: The core of the Transformer architecture is the self-attention mechanism, which computes attention weights that capture the importance of each token in the input sequence with respect to every other token. This allows the model to weigh the influence of each token on the representation of every other token, enabling it to capture long-range dependencies efficiently.
  2. Encoder and Decoder Stacks: The Transformer consists of an encoder and a decoder stack. Each stack is composed of multiple layers of self-attention and feed-forward neural networks. The encoder processes the input sequence, while the decoder generates the output sequence. Both the encoder and decoder layers are identical in architecture but operate differently during training and inference.
  3. Positional Encoding: Since the Transformer architecture does not inherently understand the order of tokens in a sequence like recurrent or convolutional models, positional encoding is added to the input embedding to provide positional information. This allows the model to distinguish between tokens based on their position in the sequence.
  4. Multi-Head Attention: To enhance the model’s ability to capture different types of dependencies, the self-attention mechanism in the Transformer employs multi-head attention. This involves computing multiple sets of attention weights in parallel, each representing a different “head” of attention. The outputs of the different heads are then concatenated and linearly transformed to produce the final attention output.
  5. Feedforward Neural Networks: Each layer in the Transformer architecture contains a feed-forward neural network, typically with a ReLU activation function. This network operates independently on each position in the sequence and projects the representations learned through self-attention into a higher-dimensional space.
  6. Layer Normalization and Residual Connections: To facilitate training and improve the flow of gradients, layer normalization and residual connections are used within each layer of the Transformer architecture. Layer normalization normalizes the activations of each layer, while residual connections allow the gradients to flow directly through the network without vanishing or exploding.

The Transformer architecture has demonstrated state-of-the-art performance in various natural language processing tasks and has become the basis for several subsequent models, such as BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-To-Text Transfer Transformer). It has also been adapted for tasks beyond natural language processing, such as image generation and reinforcement learning.

You could take a deep dive with transformer architecture with the following code by engineers at Google

Diffusion models

Diffusion models are advanced machine learning algorithms that uniquely generate high-quality data by progressively adding noise to a dataset and then learning to reverse this process. This innovative approach enables them to create remarkably accurate and detailed outputs, from lifelike images to coherent text sequences. Central to their function is the concept of gradually degrading data quality, only to reconstruct it to its original form or transform it into something new. This technique enhances the fidelity of generated data and offers new possibilities in areas like medical imaging, autonomous vehicles, and personalized AI assistants.

Diffusion models are inspired by non-equilibrium thermodynamics. It defines a markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise

Now diffusion model’s mathematics goes beyond the scope of this blog as it will require everyone to understand non-equilibrium thermodynamics. But we will try to explain the overall process through a series of pictures

The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.

Here diffusion process is split into forward and reverse diffusion processes. The forward diffusion process is a process of turning an image into noise, and the reverse diffusion process is supposed to turn that noise into the image again.

Forward diffusion process

First, you need to know how to destroy structure in a data distribution.

If we take any image it has some non-random distribution. We don’t know the distribution, but our goal is to destroy it so we can do it by adding a noise to it. At the end of that process, we should end up with noise similar to pure noise.

Forward diffusion process using 10 steps, we gradually add noise to the data

Have a look at interactive diagram to understand the process : https://erdem.pl/2023/11/step-by-step-visual-introduction-to-diffusion-models#forward-diffusion-diagram

Reverse Diffusion process

As you probably figured out, the goal of the reverse diffusion process is to convert pure noise into an image. To do that, we’re going to use some neural network in particular GAN. We’re trying to train something similar to the generator network. The only difference is that our network will have an easier job because it doesn’t have to do all the work in one step.

Learning in this framework involves estimating small perturbations to a diffusion process. Estimating small perturbations is more tractable than explicitly describing the full distribution with a single, non-analytically-normalizable, potential function. Furthermore, since a diffusion process exists for any smooth target distribution, this method can capture data distributions of arbitrary form.

High level view of one step from the reverse diffusion process
Reverse diffusion process

Diffusion model uses a modified U-Net architecture. More details can be found in the reference about its architecture.

Some of the most popular diffusion models, which have gained widespread attention for their impressive capabilities in image generation, include: DALL.E 2, DALL.E 3, Sora, Stable Diffusion, Midjourney , NAI Diffusion, Imagen

Some handpicked outcomes as provided by OpenAI DALL.E 2
One could get extremely creative in text prompt and get incredible results as showcased by this output by DALL.E 3
OpenAI Sora has made leap forward in text to video models and its looks like its going get better with each iteration in future

Conclusion

Generative AI architectures have come a long way since their inception, driven by advancements in deep learning, probabilistic modeling, and reinforcement learning. While challenges remain, the future of generative AI holds immense promise, with potential applications ranging from creative expression to scientific discovery and beyond. As researchers continue to push the boundaries of what is possible, the transformative impact of generative AI on society is poised to grow exponentially. Generative AI architectures represent a remarkable fusion of advanced machine learning techniques, probabilistic modeling, and neural network innovations. These architectures have evolved significantly over the years, from early statistical methods to the groundbreaking advancements in deep learning with models like GANs and transformers. Through probabilistic frameworks, representation learning, and adversarial training, generative AI architectures have demonstrated the ability to generate diverse and realistic outputs across various domains such as art, music, text, and more.

References

  1. The Math Behind GANs — Jake Tae

2. Math behind GAN (generative adversarial networks) & its applications (labellerr.com)

3. Introduction to autoencoders. (jeremyjordan.me)

4. The Mathematics of Variational Auto-Encoders • David Stutz

5. transformers.pdf (johnthickstun.com)

6. A mathematician’s introduction to transformers and large language models | JSC Accelerating Devices Lab (fz-juelich.de)

7. Transformer Math 101 | EleutherAI Blog

8. Transformers from Scratch (e2eml.school)

9. Step by step diffusion process

Generative Adversarial Networks (GANs):

  • Goodfellow, I., et al. (2014). “Generative Adversarial Nets.”
  • Arjovsky, M., et al. (2017). “Wasserstein GAN.”
  • Karras, T., et al. (2019). “A Style-Based Generator Architecture for Generative Adversarial Networks.”

Variational Autoencoders (VAEs):

  • Kingma, D. P., & Welling, M. (2013). “Auto-Encoding Variational Bayes.”
  • Rezende, D. J., et al. (2014). “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.”
  • Higgins, I., et al. (2017). “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.”

Other Generative Models:

  • Radford, A., et al. (2015). “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.”
  • Oord, A. van den, et al. (2016). “Conditional Image Generation with PixelCNN Decoders.”
  • Vaswani, A., et al. (2017). “Attention is All You Need.”

Applications and Advances:

  • Brock, A., et al. (2019). “Large Scale GAN Training for High Fidelity Natural Image Synthesis.”
  • Karras, T., et al. (2020). “Training Generative Adversarial Networks with Limited Data.” Link
  • Radford, A., et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.”

Additional information

--

--