Understanding Large Language Models: Architecture and Self-Attention Explained

Tejaswi kashyap
4 min readJul 30, 2023

--

Large language models have revolutionized natural language processing, enabling computers to understand and generate human-like text. Based on the transformer architecture, these models have become the cornerstone of modern NLP applications. In this article, we’ll delve into the workings of large language models, explore their architecture, and focus on the crucial components of the encoder, decoder, and self-attention.

  1. How Large Language Models Work: At the core, large language models are neural networks designed to process sequential data, such as sentences or paragraphs. They use a variant of the transformer architecture, which allows them to learn complex patterns and dependencies in language data.
  2. The Transformer Architecture: The transformer architecture, introduced by Vaswani et al. in 2017, revolutionized sequence-to-sequence learning. It consists of an encoder-decoder structure, each containing self-attention layers. Self-attention, also known as scaled dot-product attention, is a crucial mechanism that enables the model to weigh the importance of different words in a sentence when processing a given word.

Let’s take a look at the encoder and decoder part of the transformers

The Encoder: The encoder is the first part of the transformer architecture. It processes the input sequence and transforms it into a rich contextualized representation. Each encoder layer contains two sub-layers: a. Self-Attention Layer: This layer computes the self-attention mechanism. It allows the model to focus on different words in the input sequence while encoding a specific word. The model learns which words are essential for understanding the current word, capturing long-range dependencies efficiently. b. Feed-Forward Neural Network: After computing self-attention, the output passes through a feed-forward neural network, which introduces non-linearity and further refines the contextualized representation.

The Decoder: The decoder is the second part of the transformer architecture. It generates the output sequence based on the contextualized representation from the encoder. Like the encoder, each decoder layer contains two sub-layers: a. Self-Attention Layer: The decoder self-attention layer allows the model to attend to different positions in the output sequence while predicting a word at a specific position. This enables the model to maintain coherence and relevance throughout the generated sequence. b. Encoder-Decoder Attention Layer: This layer helps the decoder focus on relevant parts of the input sequence during the decoding process. It allows the model to align the input and output sequences effectively.

The part which makes the transformer works like a transformer

Self-Attention: Self-attention is a fundamental building block of large language models. It allows the model to compute the importance of each word in a sentence concerning the word under consideration. The attention score is calculated based on the similarity of the word embeddings. Words that are semantically related receive higher attention scores, influencing the final representation of the word.

Here’s a brief explanation of the self-attention mechanism and its working:

  1. Input Representation: Before feeding the sentence into the self-attention layer, each word in the sentence is transformed into three vectors: the Query (Q), the Key (K), and the Value (V). These vectors are obtained by multiplying the word embeddings by learnable weight matrices.
  2. Dot Product Attention: Next, the self-attention mechanism computes attention scores between each pair of words (i.e., Q and K vectors). The attention score represents how much a word should attend to other words in the sentence. This is done by taking the dot product between the Query vector of a word and the Key vectors of all the words in the sentence.
  3. Softmax and Attention Weights: The dot product attention scores are then passed through a softmax function, which converts them into a probability distribution. The softmax ensures that the attention weights sum up to 1, determining the focus of each word on other words in the sentence.
  4. Contextual Representation: Finally, the attention weights are multiplied by the Value vectors of the words to compute the weighted sum, which produces the contextual representation for each word. This contextual representation captures the word’s context within the entire sentence, incorporating information from other words.
Architecture of Transformer

--

--