Large Language Models Explained — I

5 min readJun 17, 2024

Have you ever wondered how ChatGPT works? Why is ChatGPT-3.5 available for free while ChatGPT-4 is paid? Are there any alternatives available for the same?
I will try to answer all the questions revolving around ChatGPT in this article. I will also make sure that I restrict only to the technical details through words but not the mathematics or the working program behind ChatGPT.

**Evolution of ChatGPT. Credits: iq.opengenus.org**

Introduction

To understand ChatGPT, you need to know what is a neural network. We all know that the functional unit of the human brain is a neuron. It paves ways for the brain to remember things and makes it think on its own by forming a network of neurons. Many scientists, over the past few decades, have been trying to imitate the human brain artificially to bring out similar intelligence. This broad way of study is known as Artificial Intelligence.

Basic Architecture of a Neural Network

Similar to the human brain, a neural network (of interest here) consists of several neurons that can ingest and process information. An abstract architecture of a basic neural network would look something like this:

**Multi Layered Neural Network. Credits: V7Labs.com**

It consists of an input layer through which the input is fed into the network, several hidden layers that are used to engulf the information and an output layer that gives the output according to the nature of the neural network. All these neurons are interconnected and these connections have something called weights which determine the importance of that neuron and the path where it is going. Generally, a neural network consists of neurons, weights, biases, activation functions, and optimizers (The details of all these are not the aim of this article, so they will be covered in further articles).

Evolution of RNN to LLM

Traditional Language models like n-gram predict the likelihood of the next word in a sentence based on the previous words. One more model called the Hidden Markov Model (HMM) is used to describe the probabilistic relationship between a sequence of observations and a sequence of hidden states. So we understand that a language model in general is probabilistic and that it aims to predict the next word in a sentence. This Idea is the foundation of large language models too except for the fact that it involves a lot of neurons!

Now, there are different types of neural networks such as Convolutional Neural networks (CNN), Recurrent Neural Networks (RNN), Feed Forward Neural Networks, etc. RNN forms the basis of ChatGPT. Wow!! Of course, there are a lot of disadvantages to a basic RNN. Hence, a Long Short Term Memory neural network (LSTM) was built keeping RNN as the basis. Even LSTM did not work well, hence an Encoder-Decoder model was proposed. An encoder takes in the input sequence and processes that into a fixed-length vector. This vector is sent to the Decoder which produces the output sequence. (Note: Every Encoder and Decoder has many neural networks inside them). But when the input sequence is too lengthy for the Encoder to process, it might lose information. To address this problem, an Attention mechanism was proposed that allows the Decoder to look back at the input sequence and choose to attend only the important sections of the sequence. Finally, the Transformer Architecture combined the usage of the attention mechanism in the Encoder-Decoder model and made a revolution in the industry! The transformer model is particularly expert in processing long data sequences. It is well suited to NLP tasks such as machine translation and text generation.

(Note: I have not explained anything in detail in the above paragraph. Instead, I have given all the necessary links to study everything in depth. I will also cover everything in detail in the upcoming articles).

Large Language Models are nothing but language models that are trained with huge (hence large) amounts of data. Thus it is understood that the underlying model of any LLM is predominantly the transformer model trained with huge amounts of data.

**Transformer Architecture. Credits: labellerr.com**

ChatGPT and its Working

Generative Pre-trained Transformer (GPT) is one such LLM that uses only the Decoder model (BERT uses only Encoder and BART uses Encoder-Decoder). For this article, we will consider GPT-3.5. It is trained with tons of data with about 96 layers of neurons and 175B parameters. ChatGPT is a fine-tuned version of its previous version (GPT-3) that includes a concept called ‘Reinforcement Learning With Human Feedback’ (RLHF). The 3.5 version is focused only on Text Generation which means it answers in text format to the questions we ask in text format.

But how exactly is it doing?

For every question we ask, the input is passed on to the GPT sequentially and it crosses all the 96 layers and predicts the next word (Now you know why ChatGPT gives answers word by word and not at the same time like any search engine).

Let’s say we ask, “What is the Capital of India”. This input is sent to the GPT word by word (with all the necessary vector generation) and finally with all the context words it predicts the next word “The” and then “Capital”, ”of”, “India”, “is”, “New Delhi” respectively. This knowledge is gained by the LLM during the training process. The words “capital”, “India” and “Delhi” might have occurred together many times in the data than any other words.

**Comparison of Different GPT versions. Credits: iq.opengenus.org**

Conclusion

At the end of the day, GPT is also an LLM which is formed by a neural network at its foundation. So, when we ask any question, it is like testing the pre-trained model and the answer we get is the model’s prediction. But where is the model running? Can we make the model run on our local machine? Can we make the pre-trained model to answer domain-specific questions?
I will answer all these questions in my next article. Until then, stay thirsty for knowledge!
Thank you for reading.

References

Large Language Models Explained — I

Written by Sanjay Balaji