So, how does a LLM work?

10 min readFeb 17, 2024

In this post, we are going to deep dive into the fundamentals of how LLM work while keeping the language understandable so that you’re not lost in the rabbit hole. This post aims to be beginner friendly, yet offer a deeper insight about the workings of LLM for the interested.

For brevity, details are excluded and assume this works as it’s intended

LLMs, or large language models are the tools we use to model human language. By model, we mean a framework which we have control and understanding over and we use that framework in creative ways to understand something which we do not yet.

The key recipe for an LLM is a neural network. The term neural is inspired by our human brain, a pattern recognition powerhouse. Let’s first understand the network part of it.

In real world, we often come across emergence principle, where different things come together to form a cohesive entity which is greater than the sum of it’s parts. Greater in the sense of the behaviour exhibited.

Source: https://www.independent.co.uk/news/uk/home-news/incredible-image-shows-group-of-starlings-in-shape-of-giant-bird-daniel-biber-a8138216.html

The above image is such an example. Flock of birds coming together to form a bigger flock of birds. Imagine the extent of communication required to achieve something like this.

We can typically try to understand the communication between birds by representing each bird as an atomic concept and then interactions between the birds by using a “thread” between these two concepts.

With this way of thinking, we can visualize that a complex network will mandate complex intricate web of communication. But even in that, the information flow will follow causality, and will typically follow a sequence.

For example, Let’s say at an instant, a portion of the flock of birds do a particular action to which another portion of flock of birds do action B as a response. Let’s say this starts a chain A -> B -> C -> .. ->A which simulates the big bird’s flying motion.

Sample communication network between birds

The intelligence here, lies in where and how the information flows and is transformed between two time stamps during the communication flow. And neural network models exactly that.

Complete information flow between the two sets of birds

We do not know initially, where the information should flow, how much it should flow. But what we can control is how the information is transformed at each step and at the initial step, what basic pattern we aim to capture.

Now, In order to understand the basic patterns to capture, we should understand what we are trying to model here. In the context of this blog post, we are discussing human language — Lingustics.

Human language is something that is built using set of grammatical rules and has a structure. If we were to consider all the individual words as an atomic unit, in a particular sentence, the communication flow between the words is certainly high.

Dependency Tree of an example phrase | Source: https://www.researchgate.net/figure/CoreNLP-output-of-a-dependency-tree-for-our-example-phrase_fig4_328908023

And this is not only limited to words and sentence, but also between paragraphs, sections of a large book. In other words, language exhibit a property similar to the image shown below.

The universe as like human brain | Source: https://foglets.com/the-universe-as-like-human-brain-discover-scientists/

It means across zoom levels, we see similar kind of patterns. Stress on the word similar. It’s not a perfect imitation. And it’s because these patterns are observed on the same entity across different scales, we call it self similarity.

So, the neural network should take into consideration the causality to model the information flow and transformation at each step should consider the self-similar structure of the input.

Now let’s focus on the neural part of the neural network — which deals with how we process and transform the information. We take a bottom to top approach in understanding the language, starting from words.

The initial step in mathematically formulating language involves representing words as numerical structures.

For that, we use the underlying idea that “a word is characterized by the company it keeps” — which was popularized by Firth in the 1950s.

For example, let’s assume we have four words represented by A, B, C, and D, taken from a novel. We aim to calculate the number of times each word occurs in the context of the other words in the novel.

To define context, let’s say for a word w, 4 words to it’s left and 4 words to it’s right represent the word w’s context.

In this framework, one could examine which sets of words co-occur most often either globally or within a specified local window delimited by some page numbers, ultimately determining which pair of words appears most frequently together.

For example, for word A, we consider the four words to the left and the four words to the right, and calculate how many times words B, C, and D occur within this context.

Representation of distributional hypothesis

In the above example, we see a strong correlation between A and C. This means the numbers we will use to represent ‘A’ has to be close to the number which we will use to represent ‘C’.

In order to get to that representation, we will first start with a rudimentary represenation.

A is represented by [0,0,0,1], B by [0,0,1,0], C by [0,1,0,0] and so on. And then we read the novel. Word dimension here is 4.

In the first context, A is the central word and C occurs 2 times, suppose.

In order to represent the fact that C is now more closer to A, we can nudge C in the direction of A.

C_new = C_old + (number of occurence/(total number of occurence) )* A_old.

This is a very simplistic example to represent the process of learning. Given a correct answer, adjust the input in a way that the output is closer to the correct answer.

But in the context of language, often there can be multiple correct answers. That’s the reason we rely on probablity and sampling techniques and use the learning process to approximate the distribution, instead of the correct answer itself. This probabilistic approach allows language models to capture the inherent uncertainty and variability present in natural language.

Extending the simplistic example, to learn to predict the context words given a target word and to learn to predict the target word given context word, we can learn word-level representations. These word embeddings are learned in such a way that similar words have similar vector representations, allowing for capturing semantic relationships between words in the vector space. This gives us a good starting point in the quest to understand langauge.

To extend the concept of word embeddings to sentences, we need to consider two additional factors: the interactions between words in the sentence and the position of each word within the sentence. The order of words can significantly affect the meaning of the sequence.

To inject the positional information, we need a representation such that it follows a “self-similar” structure of language — to make the life of these pattern finding machines easier, “adjustable” to accomodate errors and relatively simple for us to understand.

Enter — Sphere/Circle.

We use angle-based encoding here. Initially, we start with a 90-degree interval, allocating points at 90-degree intervals around the circle. Then, upon encountering a duplicate, we divide the angle by 2 and continue allocating points. If another duplicate is found, we further divide the angle by 2 and continue allocating points on the circle. This way, our representation range is finite, yet the formulation accomodates many points, “self-similar” structure is followed — making the life easier for pattern finding machines.

This is the core idea of positional embedding. and now this circle is a unit circle, the point on the boundary of the circle can be represented in terms of angles w.r.t to the horizontal axis and the vertical axis. and we only change the angles here in the context of learning.

So, the order of the words is taken care of. The next step is to model the interactions between these words, considering the positional information and maintaining a “self-similarity” scale. After obtaining word and positional embeddings, we have a set of signals, one along the sentence dimension and one along the word dimension.

Just to clarify, in a sequence like ABCABBABDDABBC, the word dimension is 4 (the number of unique words, typically the size of the vocabulary), and the sentence dimension is 14.

In order to find the similarity between two signals, one approach is to use Fourier transform, which decomposes the signals into their constituent frequencies. By comparing the frequency components of the signals, we can assess their similarity in terms of their spectral content. This idea is captured in Google’s FNet paper — and offers an effective alternative to the more computationally intensive alternatives.

Another approach is to leave this for network to learn how the information between the words should flow.

In order to justify the information flow between a pair of words, we ask three questions —

Are they related in general with respect to the language?
Are they related with respect to a particular context ?
Are they related with respect to their positions in the sentence?

We extract the answer of these three questions from the set of high dimensional signals we have after positional embedding by letting the signal flow in three checkpoints, the first checkpoint being Key, which maps the linguistic mapping between signals, the second checkpoint being Query, which maps the contextual mapping between signals and the final one being Value, which maps the positional relations between the signals.

Attention mechanisms use these three components to determine the importance of different words (tokens) in a sentence relative to each other.

And because language has a self-similar structure, this formulation will help us capture patterns across scales, from word levels to paragraph levels.

In order to simulate the distributional hypothesis discussed earlier, in this expanded architecture, we use the problem of Masked Language Modelling where we hide the target word and ask the network to predict the target word based on the words in the context. And let the network adjust it’s parameters of the entire architecture.

So far, this constitutes the Encoder part of Transformers. The name ‘Encoder’ reflects our process of converting the structure of language into numerical representations, utilizing similar structured components. We further model the interactions between these components using neural network.

Now, let’s delve into the workings of the decoder component within the Transformer architecture. While the encoder is responsible for processing the input sequence and extracting meaningful representations, the decoder takes these representations and generates output sequences based on them.

Gulliver’s tunnel or encoder? who knows?

So, finally we can see that the first part of the black box LLM is a Gulliver’s tunnel, the part on the right represents our increased compact understanding.

Similar to the encoder, the decoder operates using self-attention mechanisms, but with a twist. It not only attends to the input sequence but also attends to the output sequence generated so far. This enables the decoder to generate outputs one step at a time, while considering both the input and previously generated output.

The decoder’s operation can be broken down into several key steps. First, it receives the encoded representations of the input sequence from the encoder. These representations serve as the foundation for generating the output sequence.

Next, the decoder employs a self-attention mechanism to focus on relevant parts of the input and output sequences simultaneously. This allows it to capture dependencies between different elements of the sequences and ensure coherence in the generated output.

Once the relevant information has been attended to, the decoder generates the output sequence step by step. At each step, it predicts the next token in the sequence based on the attended representations and the previously generated tokens. This process continues until a special end-of-sequence token is generated, indicating the completion of the output sequence.

In order to train the decoder, a fitting problem which can be used is the problem of translation. Let’s say we have to translate a sentence in english to french.

We train the encoder on both english and french language indepedently using masked language modelling, then in the decoder, we will have a cross-attention layer to learn the interactions of QKV decomposition of english and french so that decoder can learn which part of the sentence to pay attention to, based on previous words generated.

On this joint representation, we use neural networks and in the final layer, apply certain known functions so that it becomes akin to picking the right word from the set of vocabulary (in this case, french).

The same architecture, can then be used to train Question and Answering models. For each pair of (question, answers) we can make the model learn the cross-attention to learn which part of answer to focus on for a particular question.

So, when the training is done right, quality dataset is given, we can ask the model a question and it will generate the answer.

So far, we saw that decoder only generates one word at a time. But fortunately, using the principles from the AI which learns chess, go and other games which require planning, we can extend the objective to not just the next word but rather we can have control over the entire generation sequence and hence can steer the LLMs to output in the style we want.

And because we convert the inputs to a high dimensional “signal”, we can use the techniques from signals and systems too (like fourier analysis, wavelets, structured state space sequences), to simulate the aspect of “self-similarity”. And this flexibility also allows us to take inputs from multiple modalities like text, video, image, music etc and also cross modalities like image captioning, text to image generation, text to video etc.

Hope this article was able to explain the core concepts without getting overly complex and without oversimplifying the complexity of these models.

So, how does a LLM work?

Written by Sethu Iyer