LLM : Overview

7 min readMay 20, 2024

LLM stands for Large Language model which is a type of neural network that’s trained on massive amount of data that can be found online such as

i. Web Scraping

ii. Books

iii. Transcripts

Anything that is text-based can be trained into large language model.

Now, take a step back what is a neural network. Neural network is a series of algorithms that try to recognize patterns in data and really what they are trying to do is simulate how the human brain works and llms are a specific type of neural network that focus on understanding natural language and as mentioned llms learn by reading tons of books articles, internet texts and there’s really no limitation there.

Difference between traditional programming and LLMs

With traditional programming its instruction based which means if x then y. You’re explicitly telling the computer what to do in a set of instructions to execute but with LLMs you are teaching the computer not how to do things but how to learn to do things. Let me explain with an example —

Image recognition : For this app to work with traditional approach you would require to hardcode every single rule for how to identify different letters ‘ abcd’

But if you’re handwriting these letters everybody’s handwritten letters look different so how do you use traditional programming to identify every single possible variation well that’s where this AI approach comes in instead of giving a computer explicit instructions for how to identify a handwritten letter you instead give it a bunch of examples of what handwritten letters look like and then it can infer what a new handwritten letter looks like based on all of the examples that it has which also sets machine learning and LLM apart and this new approach to programming is that they are much more flexible and adaptable meaning they can learn from their mistakes and inaccuracies and are thus more scalable. Hence, with LLM user can do -

i. Summarization

ii. Text Generation

iii. Creative writing

iv. QnA

How do llms work?

It can be split into 3 steps -

Tokenization — There are neural netwroks that are trainsed to split long text into individual tokens and a token is essentially about 3s or 4s of a word so if its a shorter word like high.

Its probably just one token but if you have a longer word like summarization — it’s going to be split into multiple pieces and the way that tokenization happens is actually different for each model.

2. Embeddings- LLM turns those tokens into embedding vectors. Turning those tokens into a bunch of numerical representation of those tokens numbers and this makes it easier for the computer to read and understand each word and how the different words relate to each other. These numbers all correspond with the position in an embeddings vector database.

Word embeddings are placed into vector database. These DB are storage and retrieval mechanism that are highly optimized for vectors. Again those are just long series of numbers as they are converted into these vectors they can easily see which words are related to other words based on how similar and close they are based on their embeddings. This is how LLM is able to predict the next word based on the previous words.

Vector DB captures the relationship between data as vectors in multidimensional space which is just lot of numbers. Vectors are objects with a magnitude and a direction which both influence how similar one vector is to another and this is how llms represent words based on those numbers where each word turned to vector capturing semantic meaning and its relationship to other words.

3. Transformers- The final step in the process is matrix representation which can be made out of those vectors. This is done by extracting some information out of the numbers and placing all of the information into a matrix through an algorithm called multihead attention.

The output of the algo is a set of numbers which tells the model how much the words and its order contributing to the sentence as a whole.

Input matrix gets converted to output matrix then converting it into natural language and the word is the final output of the whole process.

NOTE — LLM from scratch is often not necessary for vast majority of use cases. Instead one can use prompt engineering or fine-tuning in existing model can be better suited than building a LLM from scratch.

How much cost associated to train a model

Recently LLM put out by Meta — these are the computational costs associated with the training the modal on 16k Nvidia H100 GPUs. For more details visit link

Now based on llama 3 numbers we’ll say a 10 billion paramter model takes on the order of 100,000 GPU hours to train while 100 billion parameter model takes about a million GPU hours to train. So how can we translate to dollar amount here. 2 options are here -

Rent the GPUs and compute that we need to train our model via any of the big cloud providers. For example, renting nvidia A100 : $1–2 per GPU per hour.

— — → 10b model :$150,000

— — -> 100b model: $1,500,000

2. Buying a HW. In case of nvidia A100: ~$10,000

— — →GPU Cluster: ~$10,000 * 1000 = $10,000,000

here, $10 million is not the only cost — it will also consumes a tremendous amount of energy

Steps of building a LLM

Data Curation
Model Architecture
Training at Scale
Evaluation

Data Curation —

The quality of the model is driven by quality of your data. Its super important to get the training data right. LLMs require large training data sets and to get a sense of the gpt3 — it was trained on half a trillion tokens.

1. Preparation of data- Quality Filtering- remove “low-quality” text from data set using 2 techniques-

In Classifier based technique — user takes a small high quality data set and use it to train a text classification model that allows to automatically score text as either good or bad, low or high quality so that precludes the need for a human to read a trillion words of text to assess its quality.

In Heuristic based — it uses various rules of thumb to filter the text. This could be removing specific words like explicit text or using various statistical properties of text to do the filtering

2. Deduplication- several instances of same (or similar) text can bias model. Better way is to remove such texts to avoid disruption of training the model.
3. Privacy Redaction- removal of sensitive and confidential information
4. Tokenization- translate text into numbers beacuse neural networks donot understand text directly.

Instead they understand numbers. By using Bytepair Encoding Algorithm which essentially takes a corpus of text and drives from it an efficient subword vocabulary.

In other words, it figures out the best choice of subwords or character sequences to define a vocabulary from which the entire Corpus can be represented.

Model Architecture —

Transformers — Defining model architecture

As far as LLM go Transformers have emerged as the state-of-the-architecture and a Transformer is a neural network architecture that strictly uses attention mechanisms to map inputs to outputs.

Attention mechanisms are the ones which learn dependencies between different elements of a sequence based on position and content that is based on the intuition about what you’re talking about. For example,

If we see a sentence —

here, the appearance of baseball implies that bat is probably a baseball bat and not a nocturnal mammal. The content is the words making up this context — the context drives what word is going to come next and the meaning of this word here. But content isn’t enough the positioning of these words is also important.

Training at Scale-

The step three which is training these models at scale so again the central challenge of these LLM is their scale when you’re training on trillions of tokens and you’re talking about billions of parameters. There’s a lot of computational cost assosiated with these things and it is impossible to train one of these models without employing some computational tricks to speed up the training process here.

Evaluation -

When you have model in hand which is just the starting place. To see what this thing actually does how it works in context of the desired use case this is where model evaluation becomes important. There are many benchmark data sets out there such as