Introduction to Generative AI (GenAI): Understanding Deep Learning Foundations

11 min readMay 12, 2024

Imagine AI that creates realistic images, writes music, or even generates different creative text formats. This isn’t science fiction anymore — it’s Generative AI!

As we progress further into the world of “AI” and see how there has been a surge in the usage of such relating terms in popular tech discourse, the field of machine learning and artificial intelligence is showing to grow expansively over the past few years. Of course barring a few noises, it really isn’t a blown out sci-fiction conversation anymore, but rather has a much more tangible application in modern businesses.

In this series, I would discuss and present an in-depth overview of the field of Generative AI, wherein it will be an attempt to understand the roots and origin of many popular frameworks and architectures, on top of which tools such as Chat GPT, Llama and Gemini have been built and are used to solve real world problems.

With explanations of such components, their historic journey and interpretations, individualistic use cases and how they integrate together to work on industry grade problems and generating revenue, this series intends to imbibe on the above theoretical structure so as to provide a holistic view of this field. I believe it is important for us to have a certain awareness of how such frameworks have progressed and worked to create high end applications in today’s time, for which a theoretical knowledge base will be advantageous for all.

As the modules progress and we understand a the core concepts, I would then proceed to build end-to-end projects in the field of Generative AI to solve multiple use cases and visualise how we can build applications from our knowledge and previous modules. With this, I also intend to infuse a sense of practicality to this series as we understand project building and deployment.

This entire series builds on the coursework as designed by iNeuron team and is inspired from the course Mastering Generative AI with OpenAI, LangChain, and LlamaIndex V2, with the instructors Sunny Savita and Boktiar Ahmed Bappy. Along with that, a key contributor to the resources of this series would be the by Krish Naik, as he throughly has explained deep learning and generative AI concepts which will be directly referenced here.

I would implore the readers to explore this coursework, as it provides a surreal and pleasant experience for anyone who wishes to foray into the domain of Generative AI.

So with that being said, let us commence on this journey. I hope by the end of it, both you and me can take something valuable and build applications using this technology.

Data

Before we head into discussing about the architectures being used, I would like to give a brief about what categories of data are present, and what we will be working with primarily in this series.

The data can be broadly be divided into two categories, which are :

Structured Data
Unstructured Data

To begin with, structured data can also be called as labeled/supervised data, which is usually presented as data with multiple independent variables and a dependent variable. Here, there are multiple variables or features which define qualities of a given data set, while the dependent variable gives the output based on the dependent variables. A neat way to visualise this would be in a tabular form.

The features usually can be further classified into two categories, namely :

Numerical Feature

Generally an integer or float value

Categorical Feature

Generally a string or string of json/list value

As for unstructured data, it can be viewed as un-labelled/unsupervised data, which can be viewed as data with independent features in itself only, which do not lead to the data being classified in any given category as such. For our purposes, we will be working with this type of data.

We use unstructured data to primarily work for two functionalities which are described as follows :

Language Based Tasks

Here, we usually have textual or sequence based data, which is used to solve language based tasks such as question answering, text summarisation, text generation, language translation, chatbots etc.

Vision Based Tasks

Here, the data will be usually of image or video based data (which is matrix based data), which is used for working on vision based tasks such as image detection, object detection, object tracking, OCR etc.

Let us now proceed ahead to learn more about the architectures based on deep learning models which are pre requisites for working with generative AI.

History

To understand the exact domain this field falls in, we can gain a better understanding from the below given image :

Here, we can see how Generative AI is a relatively newer phenomenon, however, it is built over the existing research done in the fields of Deep Learning and so on. Now, the frameworks which we will try to gauge here would be as follows :

Artificial Neural Network (ANN)

Inspired by the human brain, ANNs are interconnected layers of neurons that process information and generate outputs. They are fundamental to deep learning and solve real-world problems like image recognition and machine translation.

Components:

Single Layer: Data flows through input, hidden layers (with weights and biases), and activation functions to produce an output.
Forward Propagation: Input moves through the network to generate an output.
Loss Function: Compares predicted output with actual value to evaluate the network’s performance.
Backward Propagation: Updates weights to minimise the loss and improve future predictions.

Multi-Layer Networks: Stacking multiple single-layer networks creates more complex and powerful architectures for solving intricate problems.

Convolutional Neural Network (CNN)

CNNs: Powerhouse for Image and Video Processing: Convolutional Neural Networks are specifically designed to excel in tasks like image recognition, object detection, and segmentation.
Key Components: The core operations of CNNs involve:
Convolution: Extracting features from local image regions using filters (kernels).
Pooling: Reducing data dimensionality while retaining key information, achieving location invariance.
Flattening: Converting the multi-dimensional output into a single vector for feeding into a regular ANN.
Mimicking the Human Visual System: CNNs’ architecture is inspired by the hierarchical structure of the human brain, enabling them to learn complex visual representations and achieve remarkable performance in various vision-related applications.

Recurrent Neural Network (RNN)

Core Function: RNNs excel in handling sequential data like language due to their recurrent nature. They process information iteratively, retaining the context from previous inputs to understand the current data point. This makes them ideal for tasks like Q&A, translation, and text generation.
Challenges and Solutions: RNNs face the vanishing gradient problem when dealing with long sequences. To overcome this, LSTM and GRU variants emerged. LSTMs have separate memory and forget gates to retain long-term context, while GRUs combine both into a single gate for efficiency.
Limitations and Future Advancements: While RNNs revolutionised language processing, their sequential processing limits their scalability with massive data. This paved the way for parallel processing architectures like Transformers and BERT, which opened doors to even more powerful language models.

Generative Adversarial Network (GAN)

Function and Architecture: GANs consist of two neural networks: a generator and a discriminator. The generator creates new data points mimicking real data, while the discriminator tries to distinguish real from fake. This adversarial training process pushes both networks to improve, leading to realistic data generation.
Challenges: GANs face limitations like vanishing gradients (hindering learning) and mode collapse (getting stuck in specific data patterns).
Applications and Future: Despite these limitations, GANs have immense potential in image generation, music composition, and various other fields due to their ability to create realistic and diverse data.

As we proceed ahead, a knowledge and a practical awareness about these would be very beneficial to learn about generative AI, as all or most of the state of the art (SOTA) models present in the industry/market run on a variation or adaptation of these architectures only. Do check out the articles linked above for an in-depth explanation.

Evolution

Now, to build on the RNNs and how they further modified, we can view the below given timeline so as to understand how the field of NLP progressed over time and helped to build models used for generative AI today as we know it.

Encoder, Decoder and Attention

Problem: DNNs struggle with variable-length sequential data like text.

Solution (2014): Encoder-Decoder with LSTMs

Encoder: Processes input sequence (e.g., a sentence) into a context vector capturing overall meaning.
Decoder: Uses context vector and its memory to predict output sequence one element at a time (e.g., translation).

Limitation: Single context vector struggles with longer sequences.

Attention Mechanism (2016):

Focuses on relevant parts of the input sequence for each output prediction.
Connects each encoder element to each decoder LSTM via an attention layer.
Allows decoder to “pay attention” to important parts, not just the context vector.

Conclusion:

Encoder-decoder with attention is powerful for sequence-to-sequence tasks.
Attention mechanism significantly improves performance, especially for longer sequences

ULM Fit — Transfer Learning in NLP

The Problem: Training NLP models often requires massive amounts of data and computational power.

Transfer Learning to the Rescue: This technique reuses knowledge from pre-trained models on new tasks, improving performance and reducing training time.

ULMFiT Makes a Difference: This innovative model excels in NLP tasks even with limited data by using transfer learning from a massive pre-trained language model.

How it Works:

Pre-training: The model learns general language understanding from a large dataset.
Fine-tuning: The model adapts to a specific NLP task (e.g., sentiment analysis).
Classification: The model performs the desired NLP task.

Benefits: ULMFiT achieves good results with less data compared to traditional methods.

Limitations: Pre-training requires significant resources. Data quality can also impact performance.

The Future: ULMFiT holds promise for various NLP tasks, especially in non-English languages and with limited data.

Transformers

Transformers address limitations of Encoder-Decoder models:

Sequential processing: Limited efficiency and scalability for complex tasks.
Long-range dependencies: Attention mechanism only partially addressed this.

Transformer model overview:

Encoder-decoder architecture at its core.
Multiple encoder and decoder layers stacked together.
Self-attention mechanism within each encoder layer.

Encoders:

Process input sequence one word at a time.
Each word goes through:
Embedding layer: Converts words to vectors.
Self-attention layer: Captures relationships between words in the sequence.
Feed-forward neural network: Processes the data.
Residual connections and layer normalisation for better training.

Self-attention:

Allows each word to attend to all other words in the sequence.
Improves understanding of context for better word embeddings.
Uses multiple heads (e.g., 8) to capture different types of dependencies.

Positional encoding:

Considers word order in the sequence.
Adds positional information to word embeddings.

Decoders:

Similar architecture to encoders.
Include encoder-decoder attention layer to attend to encoder outputs.
Generate output sequence one word at a time.

Final layers:

Linear layer and softmax layer: Convert decoder output to probabilities for next word prediction.

Conclusion:

Transformers are a powerful architecture for NLP tasks.
Challenges include computational cost and interpretability.
Future directions: efficient architectures and explainable AI techniques.

Now, one point which I would like to highlight is how with the help of transformers and ULMFiT, there was a creation of the Large Language Models (LLMs) which are now a core part of the generative AI industry. Both the properties of transformers, ULMFiT and language models combined allowed for this architecture to be created and implemented.

First, let us understand what a Language Model (LM) is exactly. To put it simply, it is the use of a probabilistic model to predict the next passage of words or sentences, given a contextual data. Although it is believed that many other tasks such as question answering, machine language translation were tried to make as the base of the LM, the task of next word generation proved most effective for the purpose of learning language and context, as it allowed the model to learn the sequential nature of language and relationships between words.

This model tries to predict or generate another word, based on a fixed amount of context it receives. In this whole process, the model eventually learns the basic understanding of the language and how there are dependencies in the language which create a context.

As we have seen above, one of the key advantages which the transformer model brings with itself is the ability to parallelly process huge amounts of textual data, rather than sequentially giving it to a model. This apart from saving huge amounts of computational resources, also gives a larger context to a model which can allow it to function more accurately and understand the language better. It allowed to take longer sentences into it’s context, say something where an LSTM,GRU or Encoder-Decoder model lacked.

Coupled with this were the techniques which were introduced in the ULMFiT model, which were to perform :

Unsupervised pre-training

This allowed the model to train on a large corpus of general domain data of the language, so as to capture general properties of the language and learning the fundamentals of language and how words interact within a broader context. Although a costly process, various trained models with this functionality are open sourced.

Supervised fine-tuning

This method fine-tunes the LM on the data. Given the pre-trained data from the previous section, this part converges faster as it only has to adapt to the scope and parameters of the target data, and allows to train robust LM even for smaller datasets.

The synergy of these components — language models, transformers, and ULMFiT — led to the birth of LLMs, which has revolutionised the field of NLP and how we perceive real world tasks in this domain today. Do check out the articles linked above for an in-depth explanation.

Conclusion

As we reach the end of this article, we have now a brief introduction to the world of generative AI and how deep learning architectures were used to build and birth of this field to it’s present state, and how the field continues to be ever so dynamic in it’s nature.

In this part the focus is more on the theoretical aspect of the models and infrastructure of this domain, as I believe that this field requires an overview, if not an in-depth understanding of it’s roots originating from deep learning architectures, so as to understand and build better in this field.

In the upcoming parts, I would further explain Large Language Models (LLMs) and build on the practical aspects of the domain and how the above literature has been used to build and contribute to the industry.

Are you curious to learn more about the inner workings of Generative AI and how it’s shaping our world? Stay tuned for the next instalment of this series, where we’ll delve deeper into Large Language Models (LLMs) and explore their practical applications!

In the meantime, feel free to subscribe to my blog or follow me on social media to get notified of new posts. With this, I would like to conclude this article and sincerely hope that you enjoyed reading this and it added value to your learning as well. Your feedback and questions are always welcome!

Credits

I would like to take the opportunity to express my gratitude towards Sunny Savita and Boktiar Ahmed Bappy, for their extremely detailed coursework at the iNeuron platform, which has allowed me to learn and present the above article. You can check out this course here.

Also, I would like to take the opportunity to thank Krish Naik for his series in deep learning on his Youtube channel, which has allowed me to learn and present the above article. You can check out his YouTube channel here. Thanks for reading!