How LLMs Work ? Explained in 9 Steps — Transformer Architecture

Kamna Sinha
Data At The Core !
Published in
4 min readDec 24, 2023
chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Transformer Architecture :

The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence.

Tokenizers :

The transformer architecture is split into two distinct parts, the encoder and the decoder. These components work in conjunction with each other and they share a number of similarities.
Machine-learning models are just big statistical calculators and they work with numbers, not words. So before passing texts into the model to process, you must first tokenize the words using Tokenizers.

This converts the words into numbers, with each number representing a position in a dictionary of all the possible words that the model can work with.

Embeddings :

Now that your input is represented as numbers, you can pass it to the embedding layer.
Embedding vector spaces have been used in natural language processing for some time, previous generation language algorithms like Word2vec use this concept.

In this simple case, each word has been matched to a token ID, and each token is mapped into a vector.

If you imagine a vector size of just three, you could plot the words into a three-dimensional space and see the relationships between those words :

3-Dimensional vector space — a simple example for vector embeddings
related words in the embedding space

You can see now how you can relate words that are located close to each other in the embedding space, and how you can calculate the distance between the words as an angle, which gives the model the ability to mathematically understand language.

distance between the words as an angle

LLM Training

Below is a basic diagram showing how weights are assigned during LLM training so that the model understands context of language based on entire sentence and can better predict next word later.

LLM Training and Attention weights

STEPS TO GO TROUGH THIS MODEL

We will simplify the working of transformer architecture for LLMs and see in 9 steps how it makes the working of LLMs the way it is.

Step 1. before passing texts into the model to process, you must first tokenize the words.

Step 2. Once the input is represented as numbers, you can pass it to the embedding layer. This layer is a trainable vector embedding space, a high-dimensional space where each token is represented as a vector and occupies a unique location within that space.

Step 3. As you add the token vectors into the base of the encoder or the decoder, you also add positional encoding. By adding the positional encoding, you preserve the information about the word order and don’t lose the relevance of the position of the word in the sentence.

Step 4. Once you’ve summed the input tokens and the positional encodings, you pass the resulting vectors to the self-attention layer.

Step 5. The self-attention weights that are learned during training and stored in these layers reflect the importance of each word in that input sequence to all other words in the sequence.

Step 6. Now that all of the attention weights have been applied to your input data, the output is processed through a fully-connected feed-forward network.

Step 7. The output of this layer is a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary.

Step 8. You can then pass these logits to a final softmax layer, where they are normalized into a probability score for each word. This output includes a probability for every single word in the vocabulary, so there’s likely to be thousands of scores here.

Step 9. One single token will have a score higher than the rest. This is the most likely predicted token.

The following diagram shows the above steps in order with diagramatic representation of steps for better understanding .

Watch this space for more on LLMs.

--

--