From Words to Vectors: Inside the LLM Transformer Architecture

4 min readAug 8, 2023

Have you ever wondered how LLM’s (Large Language Models) work? How they can seemingly ‘think’ when asked a question and produce an answer?

While humans rely on their brains to process information, LLMs use numbers, vectors, and statistical calculations to mimic this cognitive ability.

At the heart of LLMs lies the transformer architecture, a powerful neural network structure that revolutionized natural language processing.

In this article, we will explore the workings of the Transformer architecture and its pivotal role in LLMs. But first, let’s briefly recap neural networks, which are a type of machine learning model inspired by the structure and neural plasticity of the human brain. Neural networks learn from labeled training examples, adjusting the strengths of connections or weights through a method known as backpropagation. This process minimizes the difference between the neural network’s predictions and the correct answers. When the neural network encounters new data, it can make predictions by generalizing from data it has already seen.

Before the emergence of the Transformer architecture, recurrent neural networks (RNNs) were widely used for processing sequential data due to their ability to capture context and dependencies. For instance, when given a sentence, the RNN would process each word and update the hidden state, retaining information from the previous word. As the model progresses through the words in the sentence, the hidden state is continuously updated, enabling the model to retain and utilize relevant information from earlier steps. Unfortunately, RNNs process data sequentially, leading to slow processing times and limited capabilities, particularly with longer data sequences.

However, in 2017, a new approach was introduced by Google researchers in their paper titled “Attention is All You Need” by Vaswani et al. This approach largely replaced RNNs and served as the foundation for the powerful LLMs we use today. Transformers are built on the concept of self-attention, enabling the model to weigh the importance of different words in a sentence based on their relevance to each other. The Transformer architecture, depicted in [Figure 1], comprises an ‘Encoder’, which processes the input text, and a ‘Decoder’, responsible for generating the next word in the sequence.

[Figure 1] “Attention is All You Need” Vaswani et al

Let’s go through the process and understand each step of the architecture.

The Encoder (Inputs)

Since computers don’t understand natural language the same way that we do, our first step is to convert the words of our sentence into numbers or vectors. Each word in an input sentence is broken up into units called tokens through a process known as tokenization. These tokens are then transformed into vectors which are known as vector embeddings. Each vector embedding is mapped to a point in space, where similar vectors cluster together, forming what we call the embedding space. This entire process, from word to token to vector mapping, is known as input embedding.

In natural language the same word could have different meanings in different semantic context and so to account for this, we use positional encoding. This involves using a vector that encodes the distance between words in a sentence, providing positional information or ‘context’ to the embedded vectors.

These new embedded input vectors with positional encoding progress to the Encoder block to the multi-head attention layer. This layer enables the model to focus on different aspects of the input sentence simultaneously using multiple ‘heads,’ each performing an attention mechanism. Attention, in this context, allows the model to determine which parts of the input sentence are most relevant. An attention vector is produced for each word, establishing the word’s significance relative to other words within the input sequence.

These attention vectors then proceed to a feed forward neural network which, unlike a regular neural network, has a unidirectional flow of information. In this step, the neural network helps the model capture essential feature representations, patterns and relationships within the data. Additionally, the vectors are transformed into a format that the subsequent encoder/decoder block can understand.

The Decoder (Outputs)

The initial steps of the Decoder mirror those of the Encoder. However, in the Decoder, the output embedding represents vectors from the target sequence, which are augmented with positional encoding vectors to convey context.

These augmented vectors are fed into the Decoder block, featuring a masked multi-head attention layer. Here, some of the words in the Decoder block are masked from the model’s view. While the model can learn from all of the words in the sentence via the Encoder block, only the previous words in the sentence from the Decoder block are accessible.

The final block, the multi-head attention encoder-decoder block, generates attention vectors for each word in both the input and output sequences. These attention vectors then traverse through the feed-forward layer, linear layer, and softmax layer to predict the next word in the sequence. The feed-forward layer processes each attention vector one at a time, allowing for parallelization, which facilitates incredibly fast processing times and empowers the model to tackle complex tasks efficiently.

The transformer architecture revolutionized natural language processing and has become the backbone of large language models. Through a multi-step process, words are transformed into numerical vectors using tokenization, embedding, and positional encoding. Attention layers then enable the model to focus on relevant parts of the input sentence, generating attention vectors that feed into final blocks to predict the next word. By leveraging these self-attention mechanisms, transformers have allowed LLMs to understand context and generate human-like responses with remarkable speed and scalability.

Thank you for reading!

From Words to Vectors: Inside the LLM Transformer Architecture

The Encoder (Inputs)

The Decoder (Outputs)

Written by Harika Panuganty